Week 2: Steering - Neural Mechanics

Overview

If concepts are encoded as directions in activation space, can we control model behavior by intervening on those directions? This week explores how to read and write to the model's internal representations—not just observe them, but manipulate them at inference time. You'll learn how neural networks encode concepts as directions in high-dimensional space, understand superposition, and gain hands-on experience extracting representation vectors and using them to steer model behavior.

Learning Objectives

By the end of this week, you should be able to:

Explain what a feedforward neural network is and how it processes information
Describe how neural networks are trained and why individual neuron roles remain unknown after training
Define what an activation vector is and where it occurs in a neural network
Explain the distributed representation hypothesis
Explain the linear representation hypothesis and why it matters for interpretability
Explain the superposition hypothesis with a concrete example
Identify and describe the main components of a transformer language model: token encoders, residual stream, token decoders, multihead attention layers, and MLP layers
Understand sparse autoencoders (SAEs) at a level sufficient to use Neuronpedia
Use Neuronpedia to find representation vectors for concepts
Successfully steer a language model using representation vectors

Readings

Core Readings

Why Concepts Are (Probably) Vectors

Piantadosi (2024). Modern argument that vector representations are the natural format for concepts, connecting to the distributed representation tradition

Toy Models of Superposition

Elhage et al. (2022). Foundational paper on superposition, distributed representations, and why features do not align with neurons

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Li et al. (2023). Extracting and applying steering vectors to control model behavior

Supplementary Readings

Representation Engineering: A Top-Down Approach to AI Transparency

Zou et al. (2023). Framework for finding and manipulating high-level concept representations

Scaling and evaluating sparse autoencoders

Modern approach to decomposing superposition using SAEs

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Bricken et al. (2023). Anthropic's work on finding interpretable features with sparse dictionaries

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations

Marks & Tegmark (2023). Evidence for linear representation of concepts across diverse tasks

Optional Readings: The Binding Problem

Connectionism and Cognitive Architecture: A Critical Analysis

Fodor & Pylyshyn (1988). Classic critique arguing that connectionist networks cannot represent compositional structure—how do you bind "John loves Mary" differently from "Mary loves John" using only distributed representations?

Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems

Smolensky (1990). Connectionist response to Fodor: shows how tensor products can implement variable binding in distributed representations, enabling compositional structure

Vector Symbolic Architectures Answer Jackendoff's Challenges for Cognitive Neuroscience

Gayler (2003). Shows how high-dimensional vector operations (similar to what modern LLMs use) can represent compositional linguistic structures

Tools & Resources

Neuronpedia

Interactive tool for exploring SAE features and finding representation vectors

Tutorial: From Neurons to Steering

1. Feedforward Neural Networks: The Foundation

A feedforward neural network is a computational graph that transforms inputs to outputs through layers of simple operations. Each layer consists of:

Linear transformation: Multiply input by a weight matrix W and add bias b
Nonlinear activation: Apply a function like ReLU or GELU element-wise

Input → [Linear + Nonlinear] → [Linear + Nonlinear] → ... → Output

The magic is in the weights: billions of numbers that determine what the network computes. But where do these weights come from?

2. Training: Optimization Without Explicit Design

Neural networks are trained, not programmed. The process:

Initialize weights randomly
Show the network many examples (e.g., predict next tokens)
Measure error between predictions and actual answers
Use gradient descent to adjust weights to reduce error
Repeat millions of times

This process is remarkably effective, but creates a fundamental interpretability challenge: We never explicitly told the network what each neuron should do. The roles of individual neurons emerge from optimization, and they can be arbitrary, redundant, or polysemantic (having multiple meanings).

This is why interpretability is hard: the network works, but we don't know how in terms of what individual components compute.

3. Activation Vectors: Internal Computational States

As a neural network processes input, it creates intermediate representations at each layer. An activation vector is the state of the network at a particular layer and position.

For a language model processing text, at each token position and each layer, there's a vector (typically 768-12,288 dimensions) representing what the network "knows" at that point:

Input: "The cat sat on the"

Layer 0: [v₀₀, v₀₁, v₀₂, v₀₃, v₀₄] ← activation vectors for each token
Layer 1: [v₁₀, v₁₁, v₁₂, v₁₃, v₁₄]
...
Layer 12: [v₁₂,₀, v₁₂,₁, v₁₂,₂, v₁₂,₃, v₁₂,₄]

Output probabilities for next token

Each activation vector is a high-dimensional point in activation space. The hypothesis of mechanistic interpretability is that these vectors encode meaningful information about the input in a structured way.

4. The Distributed Representation Hypothesis

Unlike symbolic systems where concepts are discrete symbols, neural networks use distributed representations: each concept is represented by a pattern of activity across many neurons.

Key properties:

Many neurons per concept: "cat" isn't stored in neuron #47; it's a pattern across thousands of neurons
Many concepts per neuron: Each neuron participates in representing multiple concepts
Graceful degradation: Damage to a few neurons doesn't destroy specific concepts
Similarity structure: Similar concepts have similar patterns (cat ≈ dog ≠ democracy)

This is powerful for learning but challenging for interpretation: we can't point to "the democracy neuron."

5. The Linear Representation Hypothesis

A more specific claim that has proven surprisingly accurate: concepts are represented as directions in the high-dimensional activation space.

Formally: there exists a direction vector d such that the component of an activation vector a in direction d measures the presence/strength of a concept:

concept_strength = a · d (dot product)

Why this matters:

Interpretability: We can find concept directions by comparing activations
Intervention: We can add/subtract directions to steer behavior
Measurement: We can quantify concept presence by projection
Compositionality: Directions can be added/subtracted (king - man + woman ≈ queen)

Evidence: Linear probes work well, steering vectors are effective, truth directions exist across contexts, etc.

6. The Superposition Hypothesis

Here's a puzzle: models have ~100,000 neurons per layer but need to represent millions of features (concepts, patterns, facts). How?

Superposition: Models represent more features than dimensions by storing them as quasi-orthogonal directions, accepting some interference between features that rarely co-occur.

Simple Example: 2D Space, 3 Features

Feature A: direction (1, 0)
Feature B: direction (0, 1)
Feature C: direction (1/√2, 1/√2)

All nearly orthogonal if sparse!

With 100,000 dimensions and sparse features (most are zero for any given input), you can pack in millions of directions with limited interference.

Consequences:

Individual neurons are polysemantic (respond to multiple unrelated features)
We need methods like SAEs to disentangle features from neurons
The network has compressed representations through clever geometry

7. Transformer Architecture: The Components

Modern language models use the transformer architecture. Understanding its components is essential for mechanistic interpretability.

Core Components

Token Encoder: Converts each token (word piece) to an initial vector

"cat" → [0.2, -0.5, 0.8, ..., 0.1] (768-dim vector)

Residual Stream: The "main highway" of information flow. Each layer reads from and writes to this stream.

x₀ = embed(token)
x₁ = x₀ + attention₁(x₀) + MLP₁(x₀)
x₂ = x₁ + attention₂(x₁) + MLP₂(x₁)
...
xₙ = residual stream at layer n

Key insight: Each component adds information to the stream rather than replacing it. This enables information to flow across many layers.

Multihead Attention: Looks at previous tokens to gather context. "Multi-head" means parallel attention operations with different learned patterns.

Computes attention scores: which previous tokens are relevant?
Weighted sum: combine information from relevant tokens
Causal masking: Can only attend to current and previous tokens, never future ones (for autoregressive generation)

MLP (Feed-Forward) Layers: After attention, each position independently processes its vector through a 2-layer feedforward network. Often interpreted as memory storage.

Token Decoder: The final layer converts the residual stream vector to probabilities over the vocabulary

x_final → linear → logits → softmax → P(next token)

8. Sparse Autoencoders: Finding Features in Superposition

Since neurons are polysemantic due to superposition, we need tools to disentangle the features actually encoded in activation vectors. Sparse Autoencoders (SAEs) are one such tool.

Basic Idea

Train a neural network to:

Encode: Map activation vector (e.g., 768-dim) to larger sparse code (e.g., 16,384-dim) where most elements are zero
Decode: Reconstruct the original activation from the sparse code
Sparsity: Encourage only a few features to be active at once

activation (768) → encoder → sparse features (16,384) → decoder → reconstruction (768)

The learned sparse features often correspond to interpretable concepts: a direction for "medical terms", another for "positive sentiment", etc.

Using Neuronpedia

Neuronpedia is a database of SAE features trained on various models. For this course:

Browse features by searching for concepts
See which text examples maximally activate each feature
Download feature vectors for use in steering
Understand what patterns the model has learned

You don't need to understand SAE training details—just know that they decompose activations into interpretable feature directions.

9. Steering: Causal Intervention with Representation Vectors

Now we put it all together. If concepts are linear directions, we can steer behavior by adding concept vectors to activations.

Finding a Steering Vector

Contrastive examples: Create pairs where one has the concept, one doesn't
- "Tell me how to bake cookies" vs "Refuse to tell me how to bake cookies"
Extract activations: Run both through the model, record activation vectors at each layer
Take difference: v_concept = mean(activations_with) - mean(activations_without)
Result: A direction that points toward the concept

Alternatively, use SAE features from Neuronpedia as steering vectors directly.

Applying a Steering Vector

During model inference:

Run the model normally on your input
At target layer(s), intercept the activation vectors
Add the steering vector: a_new = a_original + α * v_steering
Continue processing with modified activations

The coefficient α controls strength: positive amplifies the concept, negative suppresses it.

What You Can Steer

Sentiment: make outputs more positive/negative
Truthfulness: increase honesty/accuracy
Topic: emphasize certain subjects
Style: formal vs casual language
Your concept: Whatever you're studying for your project!

Putting It All Together

The journey from neurons to steering:

Networks process information through layers of neurons
Training discovers effective weights, but neuron roles aren't explicit
Activation vectors are intermediate computational states
Concepts are represented as distributed patterns (distributed hypothesis)
Specifically, concepts are directions in activation space (linear hypothesis)
Multiple concepts pack into fewer dimensions via superposition
Transformers process sequences through attention and MLPs, modifying the residual stream
SAEs help us find interpretable feature directions despite superposition
We can extract and apply these directions to steer model behavior

In-Class Exercise: Finding and Steering with Humor Features

This exercise introduces Neuronpedia as a tool for finding concept vectors, using puns and humor as our running example. We'll explore what features the model has learned about humor and test whether we can steer model behavior.

Part 1: Neuronpedia Exploration (20 min)

Go to Neuronpedia and search for features related to humor, jokes, and puns. Try searches like:

humor, joke, pun, funny
wordplay, comedy, laughter

For each interesting feature you find:

Note the feature ID and which model/layer it comes from
Examine the top activating examples—what patterns do you see?
Is this a "clean" humor feature or does it also activate for other concepts?
Look at negative examples—what doesn't activate this feature?

Discussion: Share your findings with the class. Did different people find different features? Are there multiple "types" of humor features (setup vs punchline, wordplay vs situation comedy, etc.)?

Part 2: Feature Analysis (15 min)

Pick your most promising humor feature and analyze it more deeply:

Activation patterns: At what positions in sentences does it activate? Beginning, middle, or punchline?
Polysemanticity: Does this feature respond to non-humor contexts? What else activates it?
Compare layers: Search for similar features in earlier and later layers. How do they differ?

Record your observations—you'll use these features in the steering exercise.

Part 3: Steering with Humor Features (25 min)

Using the provided Colab notebook, apply your humor feature as a steering vector:

Load the feature vector from Neuronpedia for your chosen feature
Test prompts: Try steering on prompts like:
- "Tell me about time"
- "What's a banana?"
- "Explain why the chicken crossed the road"
Vary the coefficient: Try α = 0.5, 1.0, 2.0, 5.0. What happens at different strengths?
Negative steering: Try α = -1.0. Does it make outputs less funny?

Questions to discuss:

Does steering produce actual jokes or just "joke-adjacent" text?
Is there a threshold effect where steering suddenly kicks in?
What unintended side effects do you observe?
How does this relate to the linear representation hypothesis?

Open Neuronpedia Explorer in Colab Open SAE Steering via NDIF in Colab

Code Exercise

This week's exercise provides hands-on experience with the core concepts:

Load a transformer language model and examine its architecture
Extract and visualize activation vectors at different layers
Extract steering vectors using contrastive pairs
Apply steering vectors to control model behavior
Experiment with Neuronpedia to find feature directions

Steering Tutorial (NNSight + SAELens) Explore SAE Features (Neuronpedia) Steer with SAE Features (NDIF)

Project Assignment

Extract and Apply Steering Vectors for Your Concept

Goal: Find representation vectors for your concept and demonstrate that you can steer model behavior with them.

Requirements:

Contrastive Dataset: Create 10-20 pairs of prompts where one exhibits your concept and one doesn't
- Example for "legal reasoning": "Apply legal precedent to..." vs "Ignore legal precedent..."
- Pairs should differ primarily in your target concept
- Include diverse phrasings and contexts
Steering Vector Extraction:
- Extract activations from multiple layers (try early, middle, and late layers)
- Compute mean difference vectors
- Analyze which layers produce the strongest concept representations
Steering Experiments: Demonstrate steering on at least 5 test examples
- Show original model output (no steering)
- Show output with positive steering (amplify concept)
- Show output with negative steering (suppress concept)
- Vary steering strength (α) and report effects
Neuronpedia Exploration:
- Search Neuronpedia for features related to your concept
- Identify 3-5 relevant SAE features
- Compare SAE feature steering with your contrastive steering vectors
- Which works better? Why might that be?
Analysis: Address these questions in your writeup:
- Which layer(s) encode your concept most strongly?
- How does steering strength affect output quality vs concept presence?
- Are there unintended side effects of steering?
- Do the SAE features match your intuitions about the concept?
- How would you evaluate whether steering is "successful"?

Deliverables:

Code notebook showing extraction and steering experiments
Written report (3-4 pages) including:
- Your contrastive dataset
- Layer-wise analysis of concept representations
- Steering results with examples
- Neuronpedia findings
- Analysis and reflection
Saved steering vectors (for use in future weeks)

Due: Before Week 3 class

Project Milestone

Due: Thursday of Week 1

Select one concept from your candidate list (from Week 0) and demonstrate a steering proof-of-concept. The goal is to show that your concept exists in the model and can be controlled through activation interventions.

This is an exploratory phase—you're testing whether your concept is "steerable" and gathering initial evidence that the model has learned something meaningful about it. Don't worry about perfection; focus on demonstrating that there's something interesting to investigate further.

Deliverables:

Concept selection: Which concept did you choose and why?
Contrastive examples: 10-20 minimal pairs demonstrating your concept (e.g., polite vs. direct, formal vs. casual)
Steering demonstration: Show that you can extract a steering vector and use it to shift model behavior
- Extract vectors from contrastive activations
- Apply steering vector to new prompts
- Show 3-5 examples of successful steering
Initial observations: What did you notice? Does steering work consistently? At which layers is it most effective?

Tip: If steering doesn't work well for your first concept, try another from your candidate list. Finding a concept that responds well to steering will make the rest of the semester much more productive!