Northeastern Logo

Week 4: Representation Geometry

Overview

This week focuses on techniques for making the invisible visible. You will learn how to visualize high-dimensional activation vectors, understand the geometric structure of representations, and use visualization to discover interpretable patterns. From PCA projections to attention heatmaps, you will develop the visual intuition essential for mechanistic interpretability research.

Learning Objectives

By the end of this week, you should be able to:

Required Readings

Marks & Tegmark (2023). Truth and falsehood correspond to a linear direction in representation space.
Tigges et al. (2023). Clean case study showing sentiment is encoded linearly, with practical techniques for finding concept directions.
Feucht, Wallace & Bau (2025). A modern take on Mikolov's semantic arithmetic, identifying subspaces where parallelogram arithmetic works accurately.

Supplementary Readings

Mikolov et al. (2013). Historical origin of "linear directions encode concepts" with the famous king − man + woman = queen example.
Bolukbasi et al. (2016). Early template for manipulating concept directions—debiasing word embeddings.

Tutorial: Visualizing the Geometry of Language Model Representations

1. Linear Algebra Review

Before diving into visualization, let's review the mathematical tools we'll use.

Vectors and Operations

A vector is a list of numbers: v = [v₁, v₂, ..., vₙ]

Matrix-Vector Products

A matrix M transforms vectors: y = Mx
Each row of M computes one dot product with x

Singular Value Decomposition (SVD)

Any matrix M can be decomposed as: M = UΣVᵀ

Principal Component Analysis (PCA)

PCA finds the directions of maximum variance in data:

  1. Center the data: subtract the mean
  2. Compute SVD of the centered data matrix
  3. The columns of V are the principal components
  4. Project data onto top k components for dimensionality reduction

Why PCA matters: It lets us visualize 768-dimensional activation vectors in 2D or 3D while preserving the most important structure.

2. Visualizing Activation Vectors with PCA

Activation vectors live in high-dimensional space (typically 768-12,288 dimensions). PCA projects them into 2D or 3D while preserving as much variance as possible.

Example: Visualizing Word Embeddings

  1. Collect activation vectors for many words
  2. Apply PCA to find the top 2 principal components
  3. Plot each word at its projected coordinates
  4. Observe clusters: animals together, countries together, etc.

What to look for:

3. The Geometry of Truth

A remarkable finding: truth is represented as a linear direction in activation space, consistent across diverse contexts.

Key Results

Mass Mean-Difference Vectors

A simple but powerful technique:

  1. Collect activations for many examples with property A
  2. Collect activations for many examples without property A
  3. Compute: direction_A = mean(with_A) - mean(without_A)
  4. This direction captures the essence of property A

Applications:

Euclidean Classifiers

Use the mass mean-difference vector as a classifier:

score = activation · direction_vector
if score > threshold: predict property A present
else: predict property A absent

This simple linear classifier often works surprisingly well, supporting the linear representation hypothesis.

4. The Geometry of Token Embeddings and Unembeddings

Token Embedding Matrix (Encoder)

Converts token IDs to vectors: E[token_id] → vector
Each row of E is a token's initial representation.

Unembedding Matrix (Decoder)

Converts final activations to vocabulary logits: logits = U × activation
Each row of U is a direction in activation space that "votes" for that token.

Geometric Insight

Token embedding and unembedding matrices often share similar geometric structure:

5. Semantic-Vector Arithmetic

One of the most striking properties of neural language representations: you can do meaningful arithmetic with concept vectors.

Classic Examples

king - man + woman ≈ queen
Paris - France + Germany ≈ Berlin
walking - walk + swim ≈ swimming

Why It Works

If concepts are linear directions, then:

Limitations

6. Visualizing Attention Patterns

Attention weights tell us which tokens each position is "looking at." Visualizing these patterns reveals interpretable structure.

Attention Heatmap

For each head, create a matrix where A[i,j] = attention from position i to position j
Plot as a heatmap with tokens on both axes.

Common Patterns

7. Induction Heads: Visualizing Pattern-Copying

Now we dive deep into induction heads, one of the first interpretable circuits discovered in transformers.

What Induction Heads Do

Pattern copying: if the model has seen [A][B] earlier, then upon seeing [A] again, it predicts [B].

Input: "foo bar baz foo bar baz foo"
At final "foo", model predicts: "bar" (copying the pattern)

How They Work: Two-Head Cooperation

Induction requires two attention heads working together across layers:

Step 1: Previous-Token Head (earlier layer)

At "bar": attend to previous token → residual stream now encodes "previous token was foo"

Step 2: Induction Head (later layer)

Current: "foo"
Induction head: "Find where previous token was 'foo'"
→ Finds earlier "foo bar"
→ Attends to "bar" (what came after "foo")
→ Predicts "bar" for current position

Characteristic Attention Pattern

Induction heads have a distinctive "stripe" pattern in attention visualization:

Putting It All Together: A Visualization Toolkit

You now have multiple tools for understanding representations:

  1. PCA: Reduce dimensionality, visualize clusters and structure
  2. Mass mean-difference: Find concept directions
  3. Geometric analysis: Understand truth, token encodings/decodings
  4. Vector arithmetic: Test compositional structure
  5. Attention visualization: Understand information flow
  6. Induction head analysis: Identify pattern-copying mechanisms

These techniques combine to give you a comprehensive view of how language models represent and process information.

In-Class Exercise: Visualizing the Geometry of Puns

Building on our pun dataset from Week 3, we will visualize how the model represents puns versus non-puns and attempt to find a "pun direction" in activation space.

Code repository: davidbau.github.io/puns — Mechanisms of Pun Awareness in LLMs. Contains notebooks for exploring how models distinguish humorous vs. serious contexts, activation collection and analysis scripts, and visualization tools.

Part 1: Collecting Activations (15 min)

Using the provided notebook, load your pun dataset from Week 3 and extract activations:

  1. Load 20-30 puns and 20-30 non-pun sentences
  2. Run each through the model and extract the residual stream at multiple layers
  3. Focus on the final token position (where the punchline lands)
  4. Store activations for analysis

Questions to consider: Should we extract from the punchline token specifically? What about the setup? Does position matter for pun representations?

Part 2: PCA Visualization (20 min)

Apply PCA to visualize whether puns and non-puns separate in activation space:

  1. Layer comparison: Create PCA plots for early (layer 5), middle (layer 15), and late (layer 25) layers
  2. Color by category: Puns in one color, non-puns in another
  3. Examine structure: Do they cluster? Is there overlap? Linear separation?
  4. Try different positions: Final token vs. middle of sentence

Discussion: At which layer do puns most clearly separate from non-puns? Is the separation clean or fuzzy? What might this tell us about how the model processes humor?

Part 3: Finding the "Pun Direction" (25 min)

Compute a mass mean-difference vector to find the "pun direction":

  1. Compute means:
    • mean_pun = average(activations for pun sentences)
    • mean_nonpun = average(activations for non-pun sentences)
  2. Difference vector: pun_direction = mean_pun - mean_nonpun
  3. Test the direction:
    • Project held-out examples onto pun_direction
    • Do puns have higher scores than non-puns?
    • What is the classification accuracy using this simple linear classifier?
  4. Compare layers: Which layer gives the best pun direction?

Extension: Try the same analysis for different types of puns (wordplay vs. situation comedy). Do they have different directions?

Open Pun Geometry Notebook in Colab

Code Exercise

This week's exercise provides hands-on practice with visualization techniques:

Open Exercise in Colab

Project Milestone

Due: Thursday of Week 4

Now that you have selected your model and built a benchmark, dive deep into how the model represents your concept internally. Use visualization techniques to examine geometric structure across layers and token positions.

Geometric Structure Analysis

Deliverables:

This analysis will guide your next steps: if you find a strong linear representation at specific layers/positions, that's where you'll focus causal interventions (Week 4) and probe training (Week 5).