Northeastern Logo

Week 6: Probes

Overview

Previous weeks focused on direct intervention: patching activations, tracing circuits, measuring causal effects. This week introduces auxiliary models—small models trained to help us understand larger ones. Probes extract specific information from representations, while masks identify which components matter. These methods are faster and differentiable, but require careful interpretation to avoid spurious conclusions.

Learning Objectives

By the end of this week, you should be able to:

Required Readings

Kim et al. (2018). Foundational probing method that asks "how sensitive is the model to this human-defined concept?"
Hewitt & Liang (2019). Critical methodology: how to validate probes and ensure they reveal genuine representations.

Supplementary Readings

Tenney et al. (2019). The "edge probing" methodology for systematically studying encoded linguistic structures.

Tutorial: Auxiliary Models for Interpretability

1. Why Use Auxiliary Models?

In previous weeks, we used direct intervention: patching activations, ablating components, tracing circuits. These methods are powerful but have limitations:

Auxiliary models offer complementary benefits:

Two Main Types

Probes: Small models that read information from representations

Hidden State → [Probe] → Predicted Concept

Masks: Learned parameters that identify important components

Component × [Mask Weight] → Masked Component → Output

Key Question: What Do They Tell Us?

Critical distinction: Probes and masks show what could be done with representations, not necessarily what the model actually does. Always validate with causal interventions.

2. Linear Probes: Reading Out Information

A linear probe is the simplest auxiliary model: a linear classifier trained to extract specific information from hidden states.

The Setup

Given hidden states h at some layer, train a linear classifier to predict concept y:

ŷ = Wh + b

Where W is a weight matrix and b is a bias vector.

Training Procedure

  1. Extract representations: Run model on labeled data, save hidden states
  2. Freeze main model: Don't update the main model's weights
  3. Train classifier: Optimize W and b to predict labels
  4. Evaluate: Test on held-out data

What High Probe Accuracy Means

If a linear probe achieves high accuracy, it tells us:

Example: Probing for Sentiment

Text: "This movie was terrible"
Hidden state at layer 8: [h₁, h₂, ..., h₇₆₈]
Linear probe: ŷ = W·h + b
Prediction: Negative sentiment (98% confidence)

Interpretation: Layer 8 contains linearly accessible sentiment information. But does the model actually use this information? Need to verify with interventions.

3. Nonlinear Probes: MLP Probes

Sometimes information isn't linearly accessible. MLP probes (multi-layer perceptrons) can extract nonlinear patterns.

Architecture

ŷ = W₂ · ReLU(W₁h + b₁) + b₂

This adds a hidden layer with nonlinear activation, allowing more complex transformations.

The Interpretability Tradeoff

Aspect Linear Probe MLP Probe
Expressiveness Limited (linear only) High (nonlinear patterns)
Interpretability Clear (linear direction) Opaque (nonlinear transformation)
What it measures Linearly accessible info Computationally accessible info
Overfitting risk Lower Higher

When to Use Each

Warning: High MLP probe accuracy with low linear probe accuracy suggests information is present but not in a simple, interpretable form. The model may not use it the way your MLP probe does.

4. Probe Training: Methodology and Best Practices

Dataset Construction

1. Balanced classes:

2. Minimal confounds:

3. Diverse examples:

Train/Test Splits

Training Hyperparameters

5. Control Tasks: Validating Probe Behavior

Control tasks verify that probes actually measure what you think they measure, not spurious correlations.

Selectivity Test

Question: Does the probe specifically extract your target concept, or does it respond to other properties?

Method:

  1. Train probe on your concept (e.g., "is sentence past tense?")
  2. Create control dataset with different concept (e.g., "is sentence negative?")
  3. Test probe on control dataset
  4. Probe should have ~random accuracy on control task

If probe succeeds on control: It's not specifically measuring your concept, it's picking up a confound.

Random Label Test

Question: Is the probe overfitting to training data?

Method:

  1. Randomize labels in training set
  2. Train probe on random labels
  3. If probe achieves high accuracy, it's overfitting
  4. Real patterns should not be learnable from random labels

Layer Progression Test

Question: Which layers contain the information?

Method:

  1. Train probes at every layer
  2. Plot accuracy vs layer
  3. Reveals where information emerges and flows

6. Probe Pitfalls: What Can Go Wrong

Pitfall 1: Overfitting

Problem: Probe learns spurious patterns specific to training data.

Symptoms:

Solutions:

Pitfall 2: Underfitting

Problem: Probe is too simple to extract available information.

Symptoms:

Solutions:

Pitfall 3: The "Information Presence ≠ Causal Use" Fallacy

Problem: High probe accuracy doesn't prove the model uses that information.

Probe success: Information is ACCESSIBLE

Model uses it: Information is USED CAUSALLY

Example:

You can probe for "number of vowels in sentence"
Probe achieves 95% accuracy
→ Information is linearly accessible

But does the model use vowel count for predictions?
→ Probably not! It's likely a spurious byproduct

Solution: Validate with interventions

Pitfall 4: Confounds and Spurious Correlations

Problem: Probe learns correlated features instead of target concept.

Example:

Solutions:

7. Learned Masks: Identifying Important Components

Instead of hard ablation, learned masks use continuous parameters to identify which components matter.

The Setup

Each component (neuron, head, layer) gets a mask parameter m ∈ [0, 1]:

output = m × component_activation

Where m=1 means "keep component" and m=0 means "ablate component".

Training Objective

Optimize masks to maintain model performance while minimizing mask sum:

Loss = Task_Loss + λ · Σ(mask_weights)

Interpretation

8. Hard Ablation vs Soft Masking vs Learned Masks

Method How it Works Advantages Disadvantages
Hard Ablation Set activation to zero • Clear interpretation
• Causal claim
• No training needed
• Not differentiable
• Expensive (test each)
• Binary (on/off)
Soft Masking Multiply by weight m ∈ [0,1] • Differentiable
• Continuous effect
• Can interpolate
• Still need to set m
• Interpretation less clear
• Not truly causal
Learned Masks Train m to optimize objective • Automated discovery
• Differentiable
• Finds importance jointly
• Requires training
• Task-specific
• Correlation not causation

When to Use Each

9. Sparse Regularization for Interpretable Masks

Without regularization, learned masks tend to keep most components active (not interpretable). Sparsity regularization encourages simpler solutions.

L1 Regularization

Add sum of absolute mask weights to loss:

Loss = Task_Loss + λ · Σ|mᵢ|

Effect: Drives small weights to exactly zero (sparse solution)

L0 Regularization (Approximate)

Penalize the number of non-zero weights:

Loss = Task_Loss + λ · ||m||₀

Since L0 norm is non-differentiable, use continuous relaxations:

Choosing Regularization Strength (λ)

Practical approach:

  1. Train with multiple λ values
  2. Plot: task accuracy vs number of active components
  3. Choose λ at the "elbow" of the curve

Benefits of Sparse Masks

10. Comparing Probes with Causal Interventions

Probes and interventions often give different answers. Understanding when and why reveals what each method measures.

Case 1: Agreement (Best Case)

Probe: Concept X is linearly accessible at layer 8
Intervention: Patching layer 8 transfers concept X
→ Information is both PRESENT and CAUSALLY USED

Interpretation: Strong evidence that layer 8 processes concept X.

Case 2: Probe Succeeds, Intervention Fails

Probe: 95% accuracy for concept X
Intervention: Patching has no effect on X-related outputs
→ Information is PRESENT but NOT CAUSALLY USED

Interpretation: The model computes X but doesn't use it for this task. It might be:

Case 3: Probe Fails, Intervention Succeeds

Probe: Low accuracy (random guessing)
Intervention: Patching strongly affects X-related outputs
→ Information is USED but NOT LINEARLY ACCESSIBLE

Interpretation: The model uses X in a nonlinear or distributed way that linear probes can't extract. Try:

Case 4: Both Fail

Probe: Low accuracy
Intervention: No effect
→ Information is ABSENT (or you're looking in the wrong place)

Best Practice: Use Both Methods

  1. Start with probes: Fast exploration, identify candidate layers
  2. Validate with interventions: Test causal role of high-probe layers
  3. Investigate disagreements: Learn about representational structure

11. Using Masks for Automated Discovery

Learned masks can automate component selection, complementing manual circuit discovery.

Workflow

  1. Define task: What behavior are you studying?
  2. Initialize masks: One parameter per component (all start at 1.0)
  3. Train masks: Optimize to maintain behavior with sparsity penalty
  4. Threshold: Components with high masks are important
  5. Validate: Use ablation to verify causal importance

Hierarchical Masking

Progressively narrow down from coarse to fine-grained:

  1. Layer-level masks: Which layers matter?
  2. Head-level masks: Within important layers, which heads?
  3. Neuron-level masks: Within important heads, which neurons?

This reduces search space and computational cost.

Comparison with ACDC (Week 5)

Aspect ACDC (Path Patching) Learned Masks
Method Ablation-based Gradient-based
Causal claim Strong (actual intervention) Weak (correlation)
Speed Slow (test each edge) Fast (parallel gradient)
Granularity Edges between components Individual components
Best for Validating specific circuits Exploratory discovery

12. Research Workflow: Combining Probes and Interventions

  1. Explore with probes:
    • Train probes at all layers for your concept
    • Identify candidate layers with high accuracy
    • Fast, covers entire model
  2. Discover with masks:
    • Train learned masks on concept-relevant task
    • Identify important components (high mask weights)
    • Hierarchical: layers → heads → neurons
  3. Validate with interventions:
    • Use activation patching on probe-identified layers
    • Ablate mask-identified components
    • Verify causal role
  4. Investigate disagreements:
    • If probe succeeds but intervention fails: information present but unused
    • If intervention succeeds but probe fails: nonlinear or distributed representation
    • Both reveal important properties of how the model works
  5. Iterate:
    • Use findings to refine hypotheses
    • Design better probes based on intervention results
    • Test new components suggested by masks

Goal: Triangulate understanding using multiple methods. No single technique tells the whole story.

Extension: Concept-Based Methods (TCAV and Concept Bottleneck Models)

Beyond Token-Level Probes: Concept Activation Vectors

Probes answer: "Is information X encoded at layer L?" But what if you want to test for higher-level concepts like "politeness," "factual knowledge," or "sentiment"? Testing with Concept Activation Vectors (TCAV) extends probing to human-interpretable concepts.

1. TCAV: Testing with Concept Activation Vectors (Kim et al., 2018)

The Core Idea

Instead of training a probe on labeled data, TCAV uses user-provided examples to define concepts:

Traditional Probe: "Is part-of-speech encoded?" → Need labeled POS data

TCAV: "Does the model use 'politeness'?" → Provide examples of polite/rude text

The TCAV Method

  1. Collect concept examples:
    • Positive examples: sentences exhibiting your concept (e.g., polite requests)
    • Negative examples: random or concept-absent sentences
  2. Extract activations:
    • Run model on all examples
    • Collect activations at target layer(s)
  3. Train linear classifier:
    • Separate positive from negative examples
    • The decision boundary normal vector is the Concept Activation Vector (CAV)
  4. Compute TCAV score:
    • For a given prediction task, measure: TCAV = fraction of examples where gradient aligns with CAV
    • High TCAV → model uses this concept for predictions

Example: Detecting Sentiment Concepts

Concept: "Positive sentiment"
Positive examples: "Great!", "I love this", "Excellent work"
Negative examples: Random sentences

Train CAV → direction in layer 10 that separates these

Test: For sentiment classification task, do gradients point toward CAV?
→ If TCAV = 0.82, then 82% of examples use "positive sentiment" concept

TCAV vs Standard Probes

Aspect Standard Probe TCAV
Data needed Labeled dataset Concept examples (no labels needed)
Question "Is X encoded?" "Is X used causally?"
Measure Probe accuracy TCAV score (gradient alignment)
Concept flexibility Fixed to dataset labels User-defined concepts
Causal claim Weaker (encoding ≠ use) Stronger (tests gradient direction)

Limitations of TCAV

2. Concept Bottleneck Models (CBMs)

While TCAV tests concepts post-hoc in trained models, Concept Bottleneck Models (CBMs) build concepts into the architecture during training.

The CBM Architecture

Input → Concept Layer (interpretable) → Prediction Layer

Example (image classification):
Image → [has_beak, has_wings, is_colorful, ...] → Bird species

The middle layer is constrained to represent predefined concepts (e.g., object attributes). This makes the model inherently interpretable.

Advantages of CBMs

Challenges with CBMs

CBMs for LLMs

While CBMs originated in vision, they're being adapted for NLP:

3. Validating Concept Methods

Concept-based methods inherit validation challenges from probes, plus additional concerns:

Validation Checklist for TCAV/CBMs

  1. Concept definition:
    • Are positive examples truly representative of the concept?
    • Do negative examples appropriately contrast?
    • Test on held-out concept examples
  2. Confound control:
    • Do concept examples differ only in the target concept?
    • Test with minimal pairs (Week 4 counterfactuals)
    • Check for length, frequency, and structural confounds
  3. Causal validation (TCAV):
    • High TCAV score suggests concept is used, but verify with intervention
    • Try steering along CAV direction - does behavior change as predicted?
    • Ablate CAV direction - does it break concept-related predictions?
  4. Completeness (CBMs):
    • Measure concept leakage: does model bypass concept bottleneck?
    • Test with concept interventions: manually set concept values
    • Verify predictions change appropriately

4. Integrating TCAV with Other Methods

TCAV is most powerful when combined with other interpretability techniques:

Combine With What You Learn Example
Probes (this week) Is concept encoded AND used? Probe shows "politeness" is encoded in layer 8.
TCAV shows it's used for formality predictions.
Patching (Week 4) Which components implement the concept? TCAV identifies concept direction.
Patch along that direction to verify causality.
Circuits (Week 8) What circuit computes the concept? TCAV finds concept, circuits show how it's computed.
SAEs (Week 7) Do discovered features align with concepts? Compare SAE features with CAVs for same concept.

5. Application to Your Project

If your concept is high-level or domain-specific, TCAV may be more appropriate than standard probes:

Example: Musical Key

Example: Scientific Reasoning

Looking Ahead: Validation and Skepticism

Like standard probes, concept methods can mislead if not carefully validated. We'll see in Week 6 (Skepticism) that even compelling concept interpretations can be illusory. Apply the validation framework from Week 4 rigorously:

In-Class Exercise: Training a Pun Detector Probe

Building on our pun dataset and geometric analysis, we will train probes to detect puns from model activations and analyze whether pun understanding is linearly accessible.

Part 1: Probe Training Setup (15 min)

Prepare your pun dataset for probe training:

  1. Load your pun dataset: Use examples from Weeks 3-4 (at least 100 puns, 100 non-puns)
  2. Extract activations: Run all examples through the model, save activations at key layers
    • Focus on layers identified as important in Week 5's causal analysis
    • Extract from the final token position (punchline)
  3. Split data: 70% train, 15% validation, 15% test

Part 2: Train and Evaluate Probes (25 min)

Train linear probes across multiple layers:

  1. Train layer-by-layer:
    • For layers 0, 5, 10, 15, 20, 25, 30 (adjust for your model size)
    • Train a logistic regression classifier: pun (1) vs non-pun (0)
  2. Evaluate performance:
    • Report accuracy on test set for each layer
    • Create layer-wise accuracy plot
  3. Compare with Week 4:
    • Does probe accuracy peak at the same layers where PCA showed separation?
    • Compare probe weight direction to your mass-mean-difference "pun direction"
    • Compute cosine similarity between probe weights and pun direction

Part 3: Control Tasks and Validation (20 min)

Validate that your probe is detecting humor, not confounds:

  1. Random label baseline:
    • Shuffle labels and retrain probes
    • If accuracy is still high, probe is overfitting
  2. Selectivity test:
    • Does your pun probe fire on non-pun wordplay?
    • Does it fire on jokes that are not puns?
    • Test on edge cases to understand what it really detects
  3. Linear vs MLP comparison:
    • Train a 1-hidden-layer MLP probe at the best layer
    • Is accuracy much higher? If so, pun representation may be nonlinear
  4. Causal validation:
    • Use your probe weight vector as a steering direction
    • Does steering along this direction affect pun-like outputs?

Discussion: What does probe accuracy tell us about how the model represents humor? Is "pun detection" a linearly accessible concept, or does the model use more complex representations?

Open Probe Training Notebook in Colab

Code Exercise

This week's exercise provides hands-on experience with probes and masks:

Open Exercise in Colab

Project Milestone

Due: Thursday of Week 5

Train linear probes to extract your concept from model activations. Analyze probe weights and decision boundaries to understand how the concept is encoded, and test where the concept is linearly accessible.

Probe Training and Analysis

Deliverables:

Strong probe performance with clean linear decision boundaries suggests your concept has a simple geometric structure that can be reliably extracted—ideal for building explanations and interventions.