This week focuses on techniques for making the invisible visible. You will learn how to visualize high-dimensional activation vectors, understand the geometric structure of representations, and use visualization to discover interpretable patterns. From PCA projections to attention heatmaps, you will develop the visual intuition essential for mechanistic interpretability research.
By the end of this week, you should be able to:
Before diving into visualization, let's review the mathematical tools we'll use.
A vector is a list of numbers: v = [v₁, v₂, ..., vₙ]
v · w = v₁w₁ + v₂w₂ + ... + vₙwₙ (measures similarity/projection)
||v|| = √(v · v)cos(θ) = (v · w) / (||v|| ||w||)
A matrix M transforms vectors: y = Mx
Each row of M computes one dot product with x
Any matrix M can be decomposed as: M = UΣVᵀ
U: Left singular vectors (output space basis)Σ: Singular values (scaling factors, diagonal matrix)V: Right singular vectors (input space basis)PCA finds the directions of maximum variance in data:
V are the principal componentsWhy PCA matters: It lets us visualize 768-dimensional activation vectors in 2D or 3D while preserving the most important structure.
Activation vectors live in high-dimensional space (typically 768-12,288 dimensions). PCA projects them into 2D or 3D while preserving as much variance as possible.
What to look for:
A remarkable finding: truth is represented as a linear direction in activation space, consistent across diverse contexts.
A simple but powerful technique:
direction_A = mean(with_A) - mean(without_A)Applications:
Use the mass mean-difference vector as a classifier:
This simple linear classifier often works surprisingly well, supporting the linear representation hypothesis.
Converts token IDs to vectors: E[token_id] → vector
Each row of E is a token's initial representation.
Converts final activations to vocabulary logits: logits = U × activation
Each row of U is a direction in activation space that "votes" for that token.
Token embedding and unembedding matrices often share similar geometric structure:
E[i] · U[j]ᵀ measures how much token i "predicts" token jOne of the most striking properties of neural language representations: you can do meaningful arithmetic with concept vectors.
If concepts are linear directions, then:
Attention weights tell us which tokens each position is "looking at." Visualizing these patterns reveals interpretable structure.
For each head, create a matrix where A[i,j] = attention from position i to position j
Plot as a heatmap with tokens on both axes.
Now we dive deep into induction heads, one of the first interpretable circuits discovered in transformers.
Pattern copying: if the model has seen [A][B] earlier, then upon seeing [A] again, it
predicts [B].
Input: "foo bar baz foo bar baz foo"
At final "foo", model predicts: "bar" (copying the pattern)
Induction requires two attention heads working together across layers:
Step 1: Previous-Token Head (earlier layer)
Step 2: Induction Head (later layer)
Induction heads have a distinctive "stripe" pattern in attention visualization:
You now have multiple tools for understanding representations:
These techniques combine to give you a comprehensive view of how language models represent and process information.
Building on our pun dataset from Week 3, we will visualize how the model represents puns versus non-puns and attempt to find a "pun direction" in activation space.
Code repository: davidbau.github.io/puns — Mechanisms of Pun Awareness in LLMs. Contains notebooks for exploring how models distinguish humorous vs. serious contexts, activation collection and analysis scripts, and visualization tools.
Using the provided notebook, load your pun dataset from Week 3 and extract activations:
Questions to consider: Should we extract from the punchline token specifically? What about the setup? Does position matter for pun representations?
Apply PCA to visualize whether puns and non-puns separate in activation space:
Discussion: At which layer do puns most clearly separate from non-puns? Is the separation clean or fuzzy? What might this tell us about how the model processes humor?
Compute a mass mean-difference vector to find the "pun direction":
mean_pun = average(activations for pun sentences)mean_nonpun = average(activations for non-pun sentences)pun_direction = mean_pun - mean_nonpunExtension: Try the same analysis for different types of puns (wordplay vs. situation comedy). Do they have different directions?
Open Pun Geometry Notebook in ColabThis week's exercise provides hands-on practice with visualization techniques:
Due: Thursday of Week 4
Now that you have selected your model and built a benchmark, dive deep into how the model represents your concept internally. Use visualization techniques to examine geometric structure across layers and token positions.
This analysis will guide your next steps: if you find a strong linear representation at specific layers/positions, that's where you'll focus causal interventions (Week 4) and probe training (Week 5).