Northeastern Logo

Week 1: Foundations

Overview

How can we peer inside a running language model to see what it's "thinking"? The key insight is that transformer intermediate layers encode evolving predictions that we can decode and inspect. This week introduces the conceptual vocabulary and core techniques for mechanistic interpretability, including the logit lens—a simple but powerful idea that opened a window into the progressive refinement of representations through the network.

Learning Objectives

By the end of this week, you should be able to:

Required Readings

Ferrando, Sarti, Bisazza & Costa-jussà (2024). Accessible pedagogical overview of interpretability techniques, establishing shared vocabulary for the course.
nostalgebraist (2020). The foundational blog post introducing logit lens—decoding intermediate layer activations directly into vocabulary space.
Wendler et al. (2024). Applies logit lens to multilingual models, revealing that models often pivot through English in intermediate representations.

Supplementary Readings

Belrose et al. (2023). Refined version of logit lens with learned probes for more accurate intermediate decoding.
Elhage et al. (2021). Deeper mathematical treatment of the residual stream view. Dense but foundational.
Professor Bryce. Visual explanation of transformer architecture. Helpful background for understanding the models we're interpreting.
Yocum, Olshausen & Weiss (2025). Extends linear representations from scalar features to function-valued "feature fields" over topological domains (e.g., positions, directions). Shows that eigendecomposition of a learned domain embedding determines which functions can be linearly probed. Advanced/theoretical.

Tutorial: Understanding Transformer Internals

1. The Transformer as a Grid of States

A transformer language model processes text through three phases, creating a grid of internal states:

Transformer state grid showing three phases: encoder, layers, and decoder
  1. Encoder: First, the encoder turns each input token into a vector of neural activations. The word "the" becomes a high-dimensional vector.
  2. Layers: Then a series of neural layers mixes and transforms the vectors for each token. Information flows both along each token's "column" and across tokens via attention.
  3. Decoder: Finally, the decoder turns each vector into a prediction for the next word. After "Miles Davis plays the", the model predicts "trumpet".

2. The Residual Stream View

A transformer can be understood as a series of operations that read from and write to a shared "residual stream." At each layer, attention heads and MLP modules add information to this stream rather than replacing it entirely.

x0 = embed(token)
x1 = x0 + attention1(x0) + MLP1(x0)
x2 = x1 + attention2(x1) + MLP2(x1)
...
output = unembed(xfinal)

This view enables a key insight: we can peek at intermediate states xL to see what the model "knows" at layer L.

3. The Logit Lens: Early Exit Decoding

What if we don't wait until the final layer to decode? The logit lens applies the decoder (unembedding matrix) directly to intermediate activations:

Early exit decoding showing how predictions evolve: thing → jazz → horn → horn → trumpet

If you decode each vector early, you can see how the prediction evolves. In this example, the model's prediction for "Miles Davis plays the ___" progresses through: "thing" → "jazz" → "horn" → "horn" → "trumpet". The correct answer emerges gradually as information is refined through the layers.

intermediate_logits = WU · xL
probs = softmax(intermediate_logits)

Key observations from logit lens analysis:

4. What Does Latent Language Reveal?

When applying logit lens to multilingual models processing non-English text, a striking pattern emerges: regardless of input language, intermediate representations often decode to English tokens.

Wendler et al. (2024) studied this systematically with translation tasks. When prompting a model to translate French to Chinese (e.g., Français: "fleur" - 中文:), the intermediate layers decode to... English! The model appears to pivot through English internally, even when English is neither the input nor output language.

Possible implications:

But what does this actually show? The observation that intermediate layers decode to English is striking, but its interpretation is far from settled. Consider: the unembedding matrix was trained predominantly on English data—perhaps it simply projects any high-dimensional vector more readily onto English tokens. Or perhaps the pattern reflects tokenization artifacts rather than genuine linguistic processing. And even if the model does "pivot through English," we have no causal evidence that these intermediate English representations are actually used in the computation, rather than being epiphenomenal byproducts. We'll return to further experiments along these lines later in the course, but it's a reasonable debate: what does this really prove?

5. Variations on Logit Lens

The basic logit lens assumes we can directly project intermediate activations through the unembedding matrix. But representations at early layers may not be in the same "format" as final-layer representations, and researchers have developed many variations:

Each variation asks a slightly different question about what information is present in intermediate representations. The proliferation of "lenses" reflects both the fertility of the basic idea and the difficulty of interpreting what these projections really mean.

6. From Observation to Understanding: What Experiments Are Needed?

Logit lens generates striking observations. But observations are not explanations. The central challenge is: what additional experiments would you need to turn a logit lens observation into a scientific claim?

Class Discussion: Designing Follow-up Experiments

Consider the Wendler observation: intermediate layers decode to English during French→Chinese translation.

  1. Hypothesis: "The model uses English as an internal lingua franca."
  2. Alternative hypothesis: "The decoder matrix is biased toward English tokens."
  3. Another alternative: "This is a tokenization artifact."

Your task: For each alternative, design an experiment that would distinguish it from the main hypothesis. What would you measure? What controls would you need? What result would be convincing?

Similar questions arise for any logit lens observation:

We'll develop tools to address these questions throughout the course: causal interventions (Week 5), systematic evaluation methodology (Week 3), and validation techniques (Weeks 6-7). For now, the goal is to recognize the gap between observation and understanding—and to start thinking about how to bridge it.

In-Class Exercise: Logit Lens Workbench

In this hands-on exercise, we'll use the NDIF Logit Lens Workbench—a code-free tool for exploring transformer internals. No programming required!

Open NDIF Logit Lens Workbench

Part 1: Reproduce the Wendler Results

First, we'll replicate the "latent language" finding from Wendler et al. (2024). Does the model really pivot through English when translating between non-English languages?

Instructions:

  1. Open the workbench and select a multilingual model (e.g., Llama-2)
  2. Enter a French-to-Chinese translation prompt:
    Français: "cinq" - 中文: "五"
    Français: "coeur" - 中文: "心"
    Français: "trois" - 中文: "三"
    Français: "nuage" - 中文:
  3. Examine the logit lens heatmap for the final token position
  4. At which layers do you see English words appearing? At which layers does Chinese emerge?

Discussion Questions:

Part 2: Exploring a Research Question—Do Models Understand Puns?

Now let's apply logit lens to investigate a research question we'll revisit throughout the course: how do language models process puns?

Puns are interesting because they require recognizing that a word has multiple meanings simultaneously. If a model "gets" a pun, we might expect to see both meanings active in its intermediate representations.

Example Puns to Explore:

"I used to be a banker, but I lost interest."

"The past, present, and future walked into a bar. It was tense."

"I'm reading a book about anti-gravity. It's impossible to put down."

Instructions:

  1. Enter a pun-setup prompt that leads to a punchline word
  2. Look at the logit lens output for the position just before the punchline
  3. Do you see evidence of both meanings of the pun word?
  4. At what layer does the "correct" (pun) completion emerge?
  5. Compare with a non-pun sentence using the same words—does the pattern differ?

In-Context Learning of Puns

Here's a more sophisticated experiment: can we use in-context learning to teach the model the "pun pattern"?

Try comparing the prediction for an incomplete pun alone versus prefixed with other puns:

Without context:
"I used to be a banker, but I lost ___"
(Model might predict: "my job", "everything", "money")

With pun context:
"I used to be a tailor, but I wasn't suited for it.
I used to be a train conductor, but I switched tracks.
I used to be a banker, but I lost ___"
(Model might now predict: "interest")

Use logit lens to examine: at which layers does "interest" emerge? Does the pun context change what appears in intermediate layers, or only at the final layers? This connects back to the function vectors idea from Week 0—is there a "pun function" that gets activated by the context?

Comparing Model Sizes

Try the same pun experiments on models of different sizes. At what scale does pun understanding emerge? A small model might never predict "interest" even with pun context, while a larger model might get it immediately. Use this to explore: how large does a model need to be to "get" puns? Does the pattern in intermediate layers look different in models that succeed versus those that fail?

Using SOTA Models as a Reference

State-of-the-art models like ChatGPT or Claude can help in two ways:

Research Questions to Consider:

We'll return to the pun research question throughout the semester as we learn new interpretability methods. Each technique will give us a different lens on the same phenomenon.

Optional: Code Exercise

For those who want to dive deeper into the implementation, this exercise provides hands-on coding experience:

Open Logit Lens Exercise in Colab

Project Milestone

Due: Thursday of Week 1

Deliverable: Initial Pitch

Thursday In-Class Activity

Each team will present their five-minute elevator pitch. As a class, we will read each other's proposals and discuss them using the FINER framework:

Next Assignment (Due: Week 2)

After receiving feedback, begin exploratory analysis to test the "F" in FINER: Is this feasible? Are there signs of life?

Deliverable for Week 2: Start a running project slide deck in your team's Google Drive. Include your preliminary observations and be ready to present them in class.