If concepts are encoded as directions in activation space, can we control model behavior by intervening on those directions? This week explores how to read and write to the model's internal representations—not just observe them, but manipulate them at inference time. You'll learn how neural networks encode concepts as directions in high-dimensional space, understand superposition, and gain hands-on experience extracting representation vectors and using them to steer model behavior.
By the end of this week, you should be able to:
A feedforward neural network is a computational graph that transforms inputs to outputs through layers of simple operations. Each layer consists of:
The magic is in the weights: billions of numbers that determine what the network computes. But where do these weights come from?
Neural networks are trained, not programmed. The process:
This process is remarkably effective, but creates a fundamental interpretability challenge: We never explicitly told the network what each neuron should do. The roles of individual neurons emerge from optimization, and they can be arbitrary, redundant, or polysemantic (having multiple meanings).
This is why interpretability is hard: the network works, but we don't know how in terms of what individual components compute.
As a neural network processes input, it creates intermediate representations at each layer. An activation vector is the state of the network at a particular layer and position.
For a language model processing text, at each token position and each layer, there's a vector (typically 768-12,288 dimensions) representing what the network "knows" at that point:
Each activation vector is a high-dimensional point in activation space. The hypothesis of mechanistic interpretability is that these vectors encode meaningful information about the input in a structured way.
Unlike symbolic systems where concepts are discrete symbols, neural networks use distributed representations: each concept is represented by a pattern of activity across many neurons.
Key properties:
This is powerful for learning but challenging for interpretation: we can't point to "the democracy neuron."
A more specific claim that has proven surprisingly accurate: concepts are represented as directions in the high-dimensional activation space.
Formally: there exists a direction vector d such that the component of an activation vector a in direction d measures the presence/strength of a concept:
Why this matters:
Evidence: Linear probes work well, steering vectors are effective, truth directions exist across contexts, etc.
Here's a puzzle: models have ~100,000 neurons per layer but need to represent millions of features (concepts, patterns, facts). How?
Superposition: Models represent more features than dimensions by storing them as quasi-orthogonal directions, accepting some interference between features that rarely co-occur.
With 100,000 dimensions and sparse features (most are zero for any given input), you can pack in millions of directions with limited interference.
Consequences:
Modern language models use the transformer architecture. Understanding its components is essential for mechanistic interpretability.
Token Encoder: Converts each token (word piece) to an initial vector
Residual Stream: The "main highway" of information flow. Each layer reads from and writes to this stream.
Key insight: Each component adds information to the stream rather than replacing it. This enables information to flow across many layers.
Multihead Attention: Looks at previous tokens to gather context. "Multi-head" means parallel attention operations with different learned patterns.
MLP (Feed-Forward) Layers: After attention, each position independently processes its vector through a 2-layer feedforward network. Often interpreted as memory storage.
Token Decoder: The final layer converts the residual stream vector to probabilities over the vocabulary
Since neurons are polysemantic due to superposition, we need tools to disentangle the features actually encoded in activation vectors. Sparse Autoencoders (SAEs) are one such tool.
Train a neural network to:
The learned sparse features often correspond to interpretable concepts: a direction for "medical terms", another for "positive sentiment", etc.
Neuronpedia is a database of SAE features trained on various models. For this course:
You don't need to understand SAE training details—just know that they decompose activations into interpretable feature directions.
Now we put it all together. If concepts are linear directions, we can steer behavior by adding concept vectors to activations.
Alternatively, use SAE features from Neuronpedia as steering vectors directly.
During model inference:
The coefficient α controls strength: positive amplifies the concept, negative suppresses it.
The journey from neurons to steering:
This exercise introduces Neuronpedia as a tool for finding concept vectors, using puns and humor as our running example. We'll explore what features the model has learned about humor and test whether we can steer model behavior.
Go to Neuronpedia and search for features related to humor, jokes, and puns. Try searches like:
humor, joke, pun, funnywordplay, comedy, laughterFor each interesting feature you find:
Discussion: Share your findings with the class. Did different people find different features? Are there multiple "types" of humor features (setup vs punchline, wordplay vs situation comedy, etc.)?
Pick your most promising humor feature and analyze it more deeply:
Record your observations—you'll use these features in the steering exercise.
Using the provided Colab notebook, apply your humor feature as a steering vector:
Questions to discuss:
This week's exercise provides hands-on experience with the core concepts:
Goal: Find representation vectors for your concept and demonstrate that you can steer model behavior with them.
Due: Before Week 3 class
Due: Thursday of Week 1
Select one concept from your candidate list (from Week 0) and demonstrate a steering proof-of-concept. The goal is to show that your concept exists in the model and can be controlled through activation interventions.
This is an exploratory phase—you're testing whether your concept is "steerable" and gathering initial evidence that the model has learned something meaningful about it. Don't worry about perfection; focus on demonstrating that there's something interesting to investigate further.
Tip: If steering doesn't work well for your first concept, try another from your candidate list. Finding a concept that responds well to steering will make the rest of the semester much more productive!