Northeastern Logo

Week 2: Steering

Overview

If concepts are encoded as directions in activation space, can we control model behavior by intervening on those directions? This week explores how to read and write to the model's internal representations—not just observe them, but manipulate them at inference time. You'll learn how neural networks encode concepts as directions in high-dimensional space, understand superposition, and gain hands-on experience extracting representation vectors and using them to steer model behavior.

Learning Objectives

By the end of this week, you should be able to:

Readings

Core Readings

Piantadosi (2024). Modern argument that vector representations are the natural format for concepts, connecting to the distributed representation tradition
Elhage et al. (2022). Foundational paper on superposition, distributed representations, and why features do not align with neurons
Li et al. (2023). Extracting and applying steering vectors to control model behavior

Supplementary Readings

Zou et al. (2023). Framework for finding and manipulating high-level concept representations
Modern approach to decomposing superposition using SAEs
Bricken et al. (2023). Anthropic's work on finding interpretable features with sparse dictionaries
Marks & Tegmark (2023). Evidence for linear representation of concepts across diverse tasks

Optional Readings: The Binding Problem

Fodor & Pylyshyn (1988). Classic critique arguing that connectionist networks cannot represent compositional structure—how do you bind "John loves Mary" differently from "Mary loves John" using only distributed representations?
Smolensky (1990). Connectionist response to Fodor: shows how tensor products can implement variable binding in distributed representations, enabling compositional structure
Gayler (2003). Shows how high-dimensional vector operations (similar to what modern LLMs use) can represent compositional linguistic structures

Tools & Resources

Interactive tool for exploring SAE features and finding representation vectors

Tutorial: From Neurons to Steering

1. Feedforward Neural Networks: The Foundation

A feedforward neural network is a computational graph that transforms inputs to outputs through layers of simple operations. Each layer consists of:

Input → [Linear + Nonlinear] → [Linear + Nonlinear] → ... → Output

The magic is in the weights: billions of numbers that determine what the network computes. But where do these weights come from?

2. Training: Optimization Without Explicit Design

Neural networks are trained, not programmed. The process:

  1. Initialize weights randomly
  2. Show the network many examples (e.g., predict next tokens)
  3. Measure error between predictions and actual answers
  4. Use gradient descent to adjust weights to reduce error
  5. Repeat millions of times

This process is remarkably effective, but creates a fundamental interpretability challenge: We never explicitly told the network what each neuron should do. The roles of individual neurons emerge from optimization, and they can be arbitrary, redundant, or polysemantic (having multiple meanings).

This is why interpretability is hard: the network works, but we don't know how in terms of what individual components compute.

3. Activation Vectors: Internal Computational States

As a neural network processes input, it creates intermediate representations at each layer. An activation vector is the state of the network at a particular layer and position.

For a language model processing text, at each token position and each layer, there's a vector (typically 768-12,288 dimensions) representing what the network "knows" at that point:

Input: "The cat sat on the"

Layer 0: [v₀₀, v₀₁, v₀₂, v₀₃, v₀₄] ← activation vectors for each token
Layer 1: [v₁₀, v₁₁, v₁₂, v₁₃, v₁₄]
...
Layer 12: [v₁₂,₀, v₁₂,₁, v₁₂,₂, v₁₂,₃, v₁₂,₄]

Output probabilities for next token

Each activation vector is a high-dimensional point in activation space. The hypothesis of mechanistic interpretability is that these vectors encode meaningful information about the input in a structured way.

4. The Distributed Representation Hypothesis

Unlike symbolic systems where concepts are discrete symbols, neural networks use distributed representations: each concept is represented by a pattern of activity across many neurons.

Key properties:

This is powerful for learning but challenging for interpretation: we can't point to "the democracy neuron."

5. The Linear Representation Hypothesis

A more specific claim that has proven surprisingly accurate: concepts are represented as directions in the high-dimensional activation space.

Formally: there exists a direction vector d such that the component of an activation vector a in direction d measures the presence/strength of a concept:

concept_strength = a · d (dot product)

Why this matters:

Evidence: Linear probes work well, steering vectors are effective, truth directions exist across contexts, etc.

6. The Superposition Hypothesis

Here's a puzzle: models have ~100,000 neurons per layer but need to represent millions of features (concepts, patterns, facts). How?

Superposition: Models represent more features than dimensions by storing them as quasi-orthogonal directions, accepting some interference between features that rarely co-occur.

Simple Example: 2D Space, 3 Features

Feature A: direction (1, 0)
Feature B: direction (0, 1)
Feature C: direction (1/√2, 1/√2)

All nearly orthogonal if sparse!

With 100,000 dimensions and sparse features (most are zero for any given input), you can pack in millions of directions with limited interference.

Consequences:

7. Transformer Architecture: The Components

Modern language models use the transformer architecture. Understanding its components is essential for mechanistic interpretability.

Core Components

Token Encoder: Converts each token (word piece) to an initial vector

"cat" → [0.2, -0.5, 0.8, ..., 0.1] (768-dim vector)

Residual Stream: The "main highway" of information flow. Each layer reads from and writes to this stream.

x₀ = embed(token)
x₁ = x₀ + attention₁(x₀) + MLP₁(x₀)
x₂ = x₁ + attention₂(x₁) + MLP₂(x₁)
...
xₙ = residual stream at layer n

Key insight: Each component adds information to the stream rather than replacing it. This enables information to flow across many layers.

Multihead Attention: Looks at previous tokens to gather context. "Multi-head" means parallel attention operations with different learned patterns.

MLP (Feed-Forward) Layers: After attention, each position independently processes its vector through a 2-layer feedforward network. Often interpreted as memory storage.

Token Decoder: The final layer converts the residual stream vector to probabilities over the vocabulary

x_final → linear → logits → softmax → P(next token)

8. Sparse Autoencoders: Finding Features in Superposition

Since neurons are polysemantic due to superposition, we need tools to disentangle the features actually encoded in activation vectors. Sparse Autoencoders (SAEs) are one such tool.

Basic Idea

Train a neural network to:

  1. Encode: Map activation vector (e.g., 768-dim) to larger sparse code (e.g., 16,384-dim) where most elements are zero
  2. Decode: Reconstruct the original activation from the sparse code
  3. Sparsity: Encourage only a few features to be active at once
activation (768) → encoder → sparse features (16,384) → decoder → reconstruction (768)

The learned sparse features often correspond to interpretable concepts: a direction for "medical terms", another for "positive sentiment", etc.

Using Neuronpedia

Neuronpedia is a database of SAE features trained on various models. For this course:

You don't need to understand SAE training details—just know that they decompose activations into interpretable feature directions.

9. Steering: Causal Intervention with Representation Vectors

Now we put it all together. If concepts are linear directions, we can steer behavior by adding concept vectors to activations.

Finding a Steering Vector

  1. Contrastive examples: Create pairs where one has the concept, one doesn't
    • "Tell me how to bake cookies" vs "Refuse to tell me how to bake cookies"
  2. Extract activations: Run both through the model, record activation vectors at each layer
  3. Take difference: v_concept = mean(activations_with) - mean(activations_without)
  4. Result: A direction that points toward the concept

Alternatively, use SAE features from Neuronpedia as steering vectors directly.

Applying a Steering Vector

During model inference:

  1. Run the model normally on your input
  2. At target layer(s), intercept the activation vectors
  3. Add the steering vector: a_new = a_original + α * v_steering
  4. Continue processing with modified activations

The coefficient α controls strength: positive amplifies the concept, negative suppresses it.

What You Can Steer

Putting It All Together

The journey from neurons to steering:

  1. Networks process information through layers of neurons
  2. Training discovers effective weights, but neuron roles aren't explicit
  3. Activation vectors are intermediate computational states
  4. Concepts are represented as distributed patterns (distributed hypothesis)
  5. Specifically, concepts are directions in activation space (linear hypothesis)
  6. Multiple concepts pack into fewer dimensions via superposition
  7. Transformers process sequences through attention and MLPs, modifying the residual stream
  8. SAEs help us find interpretable feature directions despite superposition
  9. We can extract and apply these directions to steer model behavior

In-Class Exercise: Finding and Steering with Humor Features

This exercise introduces Neuronpedia as a tool for finding concept vectors, using puns and humor as our running example. We'll explore what features the model has learned about humor and test whether we can steer model behavior.

Part 1: Neuronpedia Exploration (20 min)

Go to Neuronpedia and search for features related to humor, jokes, and puns. Try searches like:

For each interesting feature you find:

  1. Note the feature ID and which model/layer it comes from
  2. Examine the top activating examples—what patterns do you see?
  3. Is this a "clean" humor feature or does it also activate for other concepts?
  4. Look at negative examples—what doesn't activate this feature?

Discussion: Share your findings with the class. Did different people find different features? Are there multiple "types" of humor features (setup vs punchline, wordplay vs situation comedy, etc.)?

Part 2: Feature Analysis (15 min)

Pick your most promising humor feature and analyze it more deeply:

  1. Activation patterns: At what positions in sentences does it activate? Beginning, middle, or punchline?
  2. Polysemanticity: Does this feature respond to non-humor contexts? What else activates it?
  3. Compare layers: Search for similar features in earlier and later layers. How do they differ?

Record your observations—you'll use these features in the steering exercise.

Part 3: Steering with Humor Features (25 min)

Using the provided Colab notebook, apply your humor feature as a steering vector:

  1. Load the feature vector from Neuronpedia for your chosen feature
  2. Test prompts: Try steering on prompts like:
    • "Tell me about time"
    • "What's a banana?"
    • "Explain why the chicken crossed the road"
  3. Vary the coefficient: Try α = 0.5, 1.0, 2.0, 5.0. What happens at different strengths?
  4. Negative steering: Try α = -1.0. Does it make outputs less funny?

Questions to discuss:

Open Neuronpedia Explorer in Colab Open SAE Steering via NDIF in Colab

Code Exercise

This week's exercise provides hands-on experience with the core concepts:

Steering Tutorial (NNSight + SAELens) Explore SAE Features (Neuronpedia) Steer with SAE Features (NDIF)

Project Assignment

Extract and Apply Steering Vectors for Your Concept

Goal: Find representation vectors for your concept and demonstrate that you can steer model behavior with them.

Requirements:

  1. Contrastive Dataset: Create 10-20 pairs of prompts where one exhibits your concept and one doesn't
    • Example for "legal reasoning": "Apply legal precedent to..." vs "Ignore legal precedent..."
    • Pairs should differ primarily in your target concept
    • Include diverse phrasings and contexts
  2. Steering Vector Extraction:
    • Extract activations from multiple layers (try early, middle, and late layers)
    • Compute mean difference vectors
    • Analyze which layers produce the strongest concept representations
  3. Steering Experiments: Demonstrate steering on at least 5 test examples
    • Show original model output (no steering)
    • Show output with positive steering (amplify concept)
    • Show output with negative steering (suppress concept)
    • Vary steering strength (α) and report effects
  4. Neuronpedia Exploration:
    • Search Neuronpedia for features related to your concept
    • Identify 3-5 relevant SAE features
    • Compare SAE feature steering with your contrastive steering vectors
    • Which works better? Why might that be?
  5. Analysis: Address these questions in your writeup:
    • Which layer(s) encode your concept most strongly?
    • How does steering strength affect output quality vs concept presence?
    • Are there unintended side effects of steering?
    • Do the SAE features match your intuitions about the concept?
    • How would you evaluate whether steering is "successful"?

Deliverables:

Due: Before Week 3 class

Project Milestone

Due: Thursday of Week 1

Select one concept from your candidate list (from Week 0) and demonstrate a steering proof-of-concept. The goal is to show that your concept exists in the model and can be controlled through activation interventions.

This is an exploratory phase—you're testing whether your concept is "steerable" and gathering initial evidence that the model has learned something meaningful about it. Don't worry about perfection; focus on demonstrating that there's something interesting to investigate further.

Deliverables:

Tip: If steering doesn't work well for your first concept, try another from your candidate list. Finding a concept that responds well to steering will make the rest of the semester much more productive!