Week 4: Representation Geometry

Overview

This week focuses on techniques for making the invisible visible. You will learn how to visualize high-dimensional activation vectors, understand the geometric structure of representations, and use visualization to discover interpretable patterns. From PCA projections to attention heatmaps, you will develop the visual intuition essential for mechanistic interpretability research.

Learning Objectives

By the end of this week, you should be able to:

Apply fundamental linear algebra concepts: vectors, dot products, norms, matrix-vector products
Understand and apply Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)
Use PCA to visualize high-dimensional activation vectors in 2D/3D
Explain the "geometry of truth" and how concepts form linear subspaces
Compute and interpret mass mean-difference vectors and Euclidean classifiers
Understand the geometry of token embeddings (encoder) and unembeddings (decoder)
Perform and interpret semantic-vector arithmetic (e.g., king - man + woman ≈ queen)
Visualize and interpret attention patterns across heads and layers
Explain how induction heads work and visualize their characteristic attention patterns

Required Readings

The Geometry of Truth: Emergent Linear Structure in LLM Representations of True/False Datasets

Marks & Tegmark (2023). Truth and falsehood correspond to a linear direction in representation space.

Linear Representations of Sentiment in Large Language Models

Tigges et al. (2023). Clean case study showing sentiment is encoded linearly, with practical techniques for finding concept directions.

Vector Arithmetic in Concept and Token Subspaces

Feucht, Wallace & Bau (2025). A modern take on Mikolov's semantic arithmetic, identifying subspaces where parallelogram arithmetic works accurately.

Supplementary Readings

Efficient Estimation of Word Representations in Vector Space (Word2Vec)

Mikolov et al. (2013). Historical origin of "linear directions encode concepts" with the famous king − man + woman = queen example.

Man is to Computer Programmer as Woman is to Homemaker?

Bolukbasi et al. (2016). Early template for manipulating concept directions—debiasing word embeddings.

Tutorial: Visualizing the Geometry of Language Model Representations

1. Linear Algebra Review

Before diving into visualization, let's review the mathematical tools we'll use.

Vectors and Operations

A vector is a list of numbers: v = [v₁, v₂, ..., vₙ]

Dot product: v · w = v₁w₁ + v₂w₂ + ... + vₙwₙ (measures similarity/projection)
Norm (length): ||v|| = √(v · v)
Cosine similarity: cos(θ) = (v · w) / (||v|| ||w||)

Matrix-Vector Products

A matrix M transforms vectors: y = Mx
Each row of M computes one dot product with x

Singular Value Decomposition (SVD)

Any matrix M can be decomposed as: M = UΣVᵀ

U: Left singular vectors (output space basis)
Σ: Singular values (scaling factors, diagonal matrix)
V: Right singular vectors (input space basis)

Principal Component Analysis (PCA)

PCA finds the directions of maximum variance in data:

Center the data: subtract the mean
Compute SVD of the centered data matrix
The columns of V are the principal components
Project data onto top k components for dimensionality reduction

Why PCA matters: It lets us visualize 768-dimensional activation vectors in 2D or 3D while preserving the most important structure.

2. Visualizing Activation Vectors with PCA

Activation vectors live in high-dimensional space (typically 768-12,288 dimensions). PCA projects them into 2D or 3D while preserving as much variance as possible.

Example: Visualizing Word Embeddings

Collect activation vectors for many words
Apply PCA to find the top 2 principal components
Plot each word at its projected coordinates
Observe clusters: animals together, countries together, etc.

What to look for:

Semantic clusters (related concepts group together)
Linear structure (arithmetic relationships visible as parallel arrows)
Outliers (unusual or polysemantic words)

3. The Geometry of Truth

A remarkable finding: truth is represented as a linear direction in activation space, consistent across diverse contexts.

Key Results

True vs. false statements produce systematically different activations
The difference vector (truth direction) generalizes across topics
Truth forms a linear subspace that can be found via PCA on contrastive examples
This structure emerges without explicit truth training

Mass Mean-Difference Vectors

A simple but powerful technique:

Collect activations for many examples with property A
Collect activations for many examples without property A
Compute: direction_A = mean(with_A) - mean(without_A)
This direction captures the essence of property A

Applications:

Truth vs. false
Positive vs. negative sentiment
Formal vs. informal language
Your concept vs. not-your-concept

Euclidean Classifiers

Use the mass mean-difference vector as a classifier:

score = activation · direction_vector
if score > threshold: predict property A present
else: predict property A absent

This simple linear classifier often works surprisingly well, supporting the linear representation hypothesis.

4. The Geometry of Token Embeddings and Unembeddings

Token Embedding Matrix (Encoder)

Converts token IDs to vectors: E[token_id] → vector
Each row of E is a token's initial representation.

Unembedding Matrix (Decoder)

Converts final activations to vocabulary logits: logits = U × activation
Each row of U is a direction in activation space that "votes" for that token.

Geometric Insight

Token embedding and unembedding matrices often share similar geometric structure:

Semantically similar tokens have similar embedding vectors
Similar tokens have similar unembedding directions
The dot product E[i] · U[j]ᵀ measures how much token i "predicts" token j
PCA on embedding/unembedding matrices reveals semantic clusters

5. Semantic-Vector Arithmetic

One of the most striking properties of neural language representations: you can do meaningful arithmetic with concept vectors.

Classic Examples

king - man + woman ≈ queen
Paris - France + Germany ≈ Berlin
walking - walk + swim ≈ swimming

Why It Works

If concepts are linear directions, then:

"Male" is a direction in embedding space
"Royalty" is another direction
"King" ≈ "royalty" + "male"
"Queen" ≈ "royalty" + "female"
Therefore: king - male + female ≈ queen

Limitations

Works best for simple, compositional relationships
Quality depends on training data biases
Not all concepts compose linearly

6. Visualizing Attention Patterns

Attention weights tell us which tokens each position is "looking at." Visualizing these patterns reveals interpretable structure.

Attention Heatmap

For each head, create a matrix where A[i,j] = attention from position i to position j
Plot as a heatmap with tokens on both axes.

Common Patterns

Diagonal: Attending to same position (self-attention)
Previous token: Strong band just below diagonal
Beginning of sequence: Vertical stripe at position 0
Punctuation: Attending to periods, commas
Syntactic dependencies: Verbs attending to subjects

7. Induction Heads: Visualizing Pattern-Copying

Now we dive deep into induction heads, one of the first interpretable circuits discovered in transformers.

What Induction Heads Do

Pattern copying: if the model has seen [A][B] earlier, then upon seeing [A] again, it predicts [B].


      Input: "foo bar baz foo bar baz foo"

      At final "foo", model predicts: "bar" (copying the pattern)

How They Work: Two-Head Cooperation

Induction requires two attention heads working together across layers:

Step 1: Previous-Token Head (earlier layer)

Attends to position i-1 from position i
Copies information about what came before
Writes this to the residual stream

At "bar": attend to previous token → residual stream now encodes "previous token was foo"

Step 2: Induction Head (later layer)

Looks for positions where the previous token matches current context
When it finds a match, it attends to what came after in the past
Promotes that token in its output

Current: "foo"
Induction head: "Find where previous token was 'foo'"
→ Finds earlier "foo bar"
→ Attends to "bar" (what came after "foo")
→ Predicts "bar" for current position

Characteristic Attention Pattern

Induction heads have a distinctive "stripe" pattern in attention visualization:

Strong attention to positions where pattern previously occurred
Offset by +1 (attending to what came after the match)
Creates a characteristic diagonal stripe pattern offset from the main diagonal

Putting It All Together: A Visualization Toolkit

You now have multiple tools for understanding representations:

PCA: Reduce dimensionality, visualize clusters and structure
Mass mean-difference: Find concept directions
Geometric analysis: Understand truth, token encodings/decodings
Vector arithmetic: Test compositional structure
Attention visualization: Understand information flow
Induction head analysis: Identify pattern-copying mechanisms

These techniques combine to give you a comprehensive view of how language models represent and process information.

In-Class Exercise: Visualizing the Geometry of Puns

Building on our pun dataset from Week 3, we will visualize how the model represents puns versus non-puns and attempt to find a "pun direction" in activation space.

Code repository: davidbau.github.io/puns — Mechanisms of Pun Awareness in LLMs. Contains notebooks for exploring how models distinguish humorous vs. serious contexts, activation collection and analysis scripts, and visualization tools.

Part 1: Collecting Activations (15 min)

Using the provided notebook, load your pun dataset from Week 3 and extract activations:

Load 20-30 puns and 20-30 non-pun sentences
Run each through the model and extract the residual stream at multiple layers
Focus on the final token position (where the punchline lands)
Store activations for analysis

Questions to consider: Should we extract from the punchline token specifically? What about the setup? Does position matter for pun representations?

Part 2: PCA Visualization (20 min)

Apply PCA to visualize whether puns and non-puns separate in activation space:

Layer comparison: Create PCA plots for early (layer 5), middle (layer 15), and late (layer 25) layers
Color by category: Puns in one color, non-puns in another
Examine structure: Do they cluster? Is there overlap? Linear separation?
Try different positions: Final token vs. middle of sentence

Discussion: At which layer do puns most clearly separate from non-puns? Is the separation clean or fuzzy? What might this tell us about how the model processes humor?

Part 3: Finding the "Pun Direction" (25 min)

Compute a mass mean-difference vector to find the "pun direction":

Compute means:
- mean_pun = average(activations for pun sentences)
- mean_nonpun = average(activations for non-pun sentences)
Difference vector: pun_direction = mean_pun - mean_nonpun
Test the direction:
- Project held-out examples onto pun_direction
- Do puns have higher scores than non-puns?
- What is the classification accuracy using this simple linear classifier?
Compare layers: Which layer gives the best pun direction?

Extension: Try the same analysis for different types of puns (wordplay vs. situation comedy). Do they have different directions?

Open Pun Geometry Notebook in Colab

Code Exercise

This week's exercise provides hands-on practice with visualization techniques:

Apply PCA to activation vectors and create 2D/3D visualizations
Compute and visualize mass mean-difference vectors
Explore token embedding and unembedding geometry
Perform semantic-vector arithmetic
Visualize attention patterns as heatmaps
Find and visualize induction heads

Open Exercise in Colab

Project Milestone

Due: Thursday of Week 4

Now that you have selected your model and built a benchmark, dive deep into how the model represents your concept internally. Use visualization techniques to examine geometric structure across layers and token positions.

Geometric Structure Analysis

Examine multiple layers:
- Apply PCA or t-SNE to activations from layers 0, 25%, 50%, 75%, 100%
- Do concept examples cluster? Is there clear separation?
- Which layer has the cleanest geometric structure?
Analyze multiple token positions: Final token, subject token, verb token, etc.
- Does geometric structure vary by position?
- Where is the concept most clearly encoded?
Test linear separability:
- Train simple linear classifiers on different (layer, position) combinations
- Report accuracy to quantify geometric structure quality
- Create heatmap showing separability across layers x positions
Find concept directions:
- Compute mass mean-difference vectors for your concept
- Test if this direction generalizes to held-out examples
- Compare direction quality across layers

Deliverables:

Geometric structure analysis:
- PCA/t-SNE plots for multiple (layer, position) combinations
- Heatmap of linear separability across layers and positions
- Identification of optimal layer(s) and position(s) for your concept
Concept direction analysis:
- Mean-difference vector and its classification accuracy
- Visualization of concept direction in PCA space
Written summary: Key observations about how and where your concept is represented
- At which layer does the concept emerge?
- Is it localized to specific positions or distributed?
- Is the representation linear/geometric or more complex?
Code: Notebook with all analysis and visualizations

This analysis will guide your next steps: if you find a strong linear representation at specific layers/positions, that's where you'll focus causal interventions (Week 4) and probe training (Week 5).