Week 6: Probes - Neural Mechanics

Overview

Previous weeks focused on direct intervention: patching activations, tracing circuits, measuring causal effects. This week introduces auxiliary models—small models trained to help us understand larger ones. Probes extract specific information from representations, while masks identify which components matter. These methods are faster and differentiable, but require careful interpretation to avoid spurious conclusions.

Learning Objectives

By the end of this week, you should be able to:

Explain the role of auxiliary models in interpretability and when to use them vs direct intervention methods
Implement linear probes to extract specific information from model representations
Compare linear vs nonlinear (MLP) probes and explain the interpretability tradeoff
Design probe training protocols: dataset construction, train/test splits, and control tasks
Interpret probe accuracy: distinguish between "information is present" vs "information is used causally"
Use control tasks to validate that probes measure intended concepts (e.g., selectivity tests)
Implement learned masks to identify important components (neurons, heads, layers)
Explain the difference between hard ablation, soft masking, and learned masks
Apply regularization techniques (L0, L1) to encourage sparse, interpretable masks
Compare probing results with causal intervention results: when do they agree/disagree?
Identify probe pitfalls: overfitting (learning spurious correlations), underfitting (probe too simple), and the "information presence ≠ causal use" fallacy
Use masking for automated component selection and pruning

Required Readings

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Kim et al. (2018). Foundational probing method that asks "how sensitive is the model to this human-defined concept?"

Designing and Interpreting Probes with Control Tasks

Hewitt & Liang (2019). Critical methodology: how to validate probes and ensure they reveal genuine representations.

Supplementary Readings

What Do You Learn from Context? Probing for Sentence Structure in Contextualized Word Representations

Tenney et al. (2019). The "edge probing" methodology for systematically studying encoded linguistic structures.

Tutorial: Auxiliary Models for Interpretability

1. Why Use Auxiliary Models?

In previous weeks, we used direct intervention: patching activations, ablating components, tracing circuits. These methods are powerful but have limitations:

Expensive: Testing every component requires many forward passes
Discrete: Hard ablations aren't differentiable
Hypothesis-driven: You need to know what to look for

Auxiliary models offer complementary benefits:

Efficient: Train once, test anywhere
Differentiable: Can use gradient-based optimization
Exploratory: Can discover patterns you didn't expect

Two Main Types

Probes: Small models that read information from representations

Hidden State → [Probe] → Predicted Concept

Masks: Learned parameters that identify important components

Component × [Mask Weight] → Masked Component → Output

Key Question: What Do They Tell Us?

Critical distinction: Probes and masks show what could be done with representations, not necessarily what the model actually does. Always validate with causal interventions.

2. Linear Probes: Reading Out Information

A linear probe is the simplest auxiliary model: a linear classifier trained to extract specific information from hidden states.

The Setup

Given hidden states h at some layer, train a linear classifier to predict concept y:

ŷ = Wh + b

Where W is a weight matrix and b is a bias vector.

Training Procedure

Extract representations: Run model on labeled data, save hidden states
Freeze main model: Don't update the main model's weights
Train classifier: Optimize W and b to predict labels
Evaluate: Test on held-out data

What High Probe Accuracy Means

If a linear probe achieves high accuracy, it tells us:

✓ The information is linearly accessible in the representation
✓ A simple linear transformation can extract it
✗ Does NOT prove the model uses this information causally
✗ Does NOT prove the model uses it in this way

Example: Probing for Sentiment


      Text: "This movie was terrible"

      Hidden state at layer 8: [h₁, h₂, ..., h₇₆₈]

      Linear probe: ŷ = W·h + b

      Prediction: Negative sentiment (98% confidence)

Interpretation: Layer 8 contains linearly accessible sentiment information. But does the model actually use this information? Need to verify with interventions.

3. Nonlinear Probes: MLP Probes

Sometimes information isn't linearly accessible. MLP probes (multi-layer perceptrons) can extract nonlinear patterns.

Architecture

ŷ = W₂ · ReLU(W₁h + b₁) + b₂

This adds a hidden layer with nonlinear activation, allowing more complex transformations.

The Interpretability Tradeoff

Aspect	Linear Probe	MLP Probe
Expressiveness	Limited (linear only)	High (nonlinear patterns)
Interpretability	Clear (linear direction)	Opaque (nonlinear transformation)
What it measures	Linearly accessible info	Computationally accessible info
Overfitting risk	Lower	Higher

When to Use Each

Start with linear: Simpler, more interpretable, sufficient for many concepts
Try MLP if: Linear probe fails but you believe information is present
Compare both: Gap between linear and MLP performance reveals nonlinearity

Warning: High MLP probe accuracy with low linear probe accuracy suggests information is present but not in a simple, interpretable form. The model may not use it the way your MLP probe does.

4. Probe Training: Methodology and Best Practices

Dataset Construction

1. Balanced classes:

Equal (or known) distribution of labels
Prevents probe from learning spurious correlations
Example: 50% positive, 50% negative sentiment

2. Minimal confounds:

Control for other variables that correlate with target
Example: If probing for tense, balance for sentiment
Otherwise probe might learn sentiment instead of tense

3. Diverse examples:

Different sentence structures, lengths, vocabularies
Test generalization, not memorization

Train/Test Splits

Standard split: 80% train, 10% validation, 10% test
Cross-validation: For small datasets, use k-fold CV
Distribution shift test: Create test set from different domain

Training Hyperparameters

Learning rate: Start with 1e-3, tune if needed
Regularization: L2 penalty to prevent overfitting
Early stopping: Stop when validation accuracy plateaus
Batch size: 32-128 typically works well

5. Control Tasks: Validating Probe Behavior

Control tasks verify that probes actually measure what you think they measure, not spurious correlations.

Selectivity Test

Question: Does the probe specifically extract your target concept, or does it respond to other properties?

Method:

Train probe on your concept (e.g., "is sentence past tense?")
Create control dataset with different concept (e.g., "is sentence negative?")
Test probe on control dataset
Probe should have ~random accuracy on control task

If probe succeeds on control: It's not specifically measuring your concept, it's picking up a confound.

Random Label Test

Question: Is the probe overfitting to training data?

Method:

Randomize labels in training set
Train probe on random labels
If probe achieves high accuracy, it's overfitting
Real patterns should not be learnable from random labels

Layer Progression Test

Question: Which layers contain the information?

Method:

Train probes at every layer
Plot accuracy vs layer
Reveals where information emerges and flows

6. Probe Pitfalls: What Can Go Wrong

Pitfall 1: Overfitting

Problem: Probe learns spurious patterns specific to training data.

Symptoms:

High training accuracy, low test accuracy
Probe uses many parameters (high capacity)
Works on training distribution, fails on shifted distribution

Solutions:

Regularization (L2 penalty, dropout)
More training data
Simpler probe (linear instead of MLP)
Early stopping based on validation set
Cross-validation

Pitfall 2: Underfitting

Problem: Probe is too simple to extract available information.

Symptoms:

Low accuracy on both training and test sets
Linear probe fails but concept seems extractable
Gap between human intuition and probe performance

Solutions:

Try nonlinear probe (MLP)
Increase probe capacity
Check if information is actually present (use intervention)
Try different layers
Verify data quality and labels

Pitfall 3: The "Information Presence ≠ Causal Use" Fallacy

Problem: High probe accuracy doesn't prove the model uses that information.

Probe success: Information is ACCESSIBLE
≠
Model uses it: Information is USED CAUSALLY

Example:


      You can probe for "number of vowels in sentence"

      Probe achieves 95% accuracy

      → Information is linearly accessible

      

      But does the model use vowel count for predictions?

      → Probably not! It's likely a spurious byproduct

Solution: Validate with interventions

Use activation patching to modify the probed direction
If output changes as expected, information is causally used
If output unchanged, information is merely present

Pitfall 4: Confounds and Spurious Correlations

Problem: Probe learns correlated features instead of target concept.

Example:

Goal: Probe for whether sentence mentions "France"
Confound: Most France examples also mention "Paris"
Probe might learn to detect "Paris" instead of "France"

Solutions:

Use control tasks to test selectivity
Balance confounds in training data
Test on adversarial examples (France without Paris)

7. Learned Masks: Identifying Important Components

Instead of hard ablation, learned masks use continuous parameters to identify which components matter.

The Setup

Each component (neuron, head, layer) gets a mask parameter m ∈ [0, 1]:

output = m × component_activation

Where m=1 means "keep component" and m=0 means "ablate component".

Training Objective

Optimize masks to maintain model performance while minimizing mask sum:

Loss = Task_Loss + λ · Σ(mask_weights)

Task_Loss: Keep model predictions accurate
Σ(mask_weights): Prefer sparse masks (fewer active components)
λ: Regularization strength (controls sparsity)

Interpretation

High mask weight (m ≈ 1): Component is important for task
Low mask weight (m ≈ 0): Component is not necessary
Intermediate weights: Component has some effect

8. Hard Ablation vs Soft Masking vs Learned Masks

Method	How it Works	Advantages	Disadvantages
Hard Ablation	Set activation to zero	• Clear interpretation • Causal claim • No training needed	• Not differentiable • Expensive (test each) • Binary (on/off)
Soft Masking	Multiply by weight m ∈ [0,1]	• Differentiable • Continuous effect • Can interpolate	• Still need to set m • Interpretation less clear • Not truly causal
Learned Masks	Train m to optimize objective	• Automated discovery • Differentiable • Finds importance jointly	• Requires training • Task-specific • Correlation not causation

When to Use Each

Hard ablation: Testing specific hypotheses, validating causal claims
Soft masking: Analyzing effect magnitude, interpolating between states
Learned masks: Exploratory discovery, automated pruning, gradient-based search

9. Sparse Regularization for Interpretable Masks

Without regularization, learned masks tend to keep most components active (not interpretable). Sparsity regularization encourages simpler solutions.

L1 Regularization

Add sum of absolute mask weights to loss:

Loss = Task_Loss + λ · Σ|mᵢ|

Effect: Drives small weights to exactly zero (sparse solution)

L0 Regularization (Approximate)

Penalize the number of non-zero weights:

Loss = Task_Loss + λ · ||m||₀

Since L0 norm is non-differentiable, use continuous relaxations:

Hard concrete distribution: Samples binary masks during training
Sigmoid with temperature: Sharp sigmoid approximates step function

Choosing Regularization Strength (λ)

Too low: Dense masks (many components active), not interpretable
Too high: Over-pruning, task performance degrades
Sweet spot: Maximum sparsity while maintaining performance

Practical approach:

Train with multiple λ values
Plot: task accuracy vs number of active components
Choose λ at the "elbow" of the curve

Benefits of Sparse Masks

Interpretability: Easier to understand with fewer components
Efficiency: Can actually remove components (pruning)
Generalization: Simpler models often generalize better
Falsifiability: Clear claims about which components matter

10. Comparing Probes with Causal Interventions

Probes and interventions often give different answers. Understanding when and why reveals what each method measures.

Case 1: Agreement (Best Case)

Probe: Concept X is linearly accessible at layer 8
Intervention: Patching layer 8 transfers concept X
→ Information is both PRESENT and CAUSALLY USED

Interpretation: Strong evidence that layer 8 processes concept X.

Case 2: Probe Succeeds, Intervention Fails

Probe: 95% accuracy for concept X
Intervention: Patching has no effect on X-related outputs
→ Information is PRESENT but NOT CAUSALLY USED

Interpretation: The model computes X but doesn't use it for this task. It might be:

A byproduct of other computations
Used for different tasks/contexts
Spuriously correlated information

Case 3: Probe Fails, Intervention Succeeds

Probe: Low accuracy (random guessing)
Intervention: Patching strongly affects X-related outputs
→ Information is USED but NOT LINEARLY ACCESSIBLE

Interpretation: The model uses X in a nonlinear or distributed way that linear probes can't extract. Try:

Nonlinear (MLP) probes
Different layers
Multiple layers combined

Case 4: Both Fail

Probe: Low accuracy
Intervention: No effect
→ Information is ABSENT (or you're looking in the wrong place)

Best Practice: Use Both Methods

Start with probes: Fast exploration, identify candidate layers
Validate with interventions: Test causal role of high-probe layers
Investigate disagreements: Learn about representational structure

11. Using Masks for Automated Discovery

Learned masks can automate component selection, complementing manual circuit discovery.

Workflow

Define task: What behavior are you studying?
Initialize masks: One parameter per component (all start at 1.0)
Train masks: Optimize to maintain behavior with sparsity penalty
Threshold: Components with high masks are important
Validate: Use ablation to verify causal importance

Hierarchical Masking

Progressively narrow down from coarse to fine-grained:

Layer-level masks: Which layers matter?
Head-level masks: Within important layers, which heads?
Neuron-level masks: Within important heads, which neurons?

This reduces search space and computational cost.

Comparison with ACDC (Week 5)

Aspect	ACDC (Path Patching)	Learned Masks
Method	Ablation-based	Gradient-based
Causal claim	Strong (actual intervention)	Weak (correlation)
Speed	Slow (test each edge)	Fast (parallel gradient)
Granularity	Edges between components	Individual components
Best for	Validating specific circuits	Exploratory discovery

12. Research Workflow: Combining Probes and Interventions

Explore with probes:
- Train probes at all layers for your concept
- Identify candidate layers with high accuracy
- Fast, covers entire model
Discover with masks:
- Train learned masks on concept-relevant task
- Identify important components (high mask weights)
- Hierarchical: layers → heads → neurons
Validate with interventions:
- Use activation patching on probe-identified layers
- Ablate mask-identified components
- Verify causal role
Investigate disagreements:
- If probe succeeds but intervention fails: information present but unused
- If intervention succeeds but probe fails: nonlinear or distributed representation
- Both reveal important properties of how the model works
Iterate:
- Use findings to refine hypotheses
- Design better probes based on intervention results
- Test new components suggested by masks

Goal: Triangulate understanding using multiple methods. No single technique tells the whole story.

Extension: Concept-Based Methods (TCAV and Concept Bottleneck Models)

Beyond Token-Level Probes: Concept Activation Vectors

Probes answer: "Is information X encoded at layer L?" But what if you want to test for higher-level concepts like "politeness," "factual knowledge," or "sentiment"? Testing with Concept Activation Vectors (TCAV) extends probing to human-interpretable concepts.

1. TCAV: Testing with Concept Activation Vectors (Kim et al., 2018)

The Core Idea

Instead of training a probe on labeled data, TCAV uses user-provided examples to define concepts:

Traditional Probe: "Is part-of-speech encoded?" → Need labeled POS data

TCAV: "Does the model use 'politeness'?" → Provide examples of polite/rude text

The TCAV Method

Collect concept examples:
- Positive examples: sentences exhibiting your concept (e.g., polite requests)
- Negative examples: random or concept-absent sentences
Extract activations:
- Run model on all examples
- Collect activations at target layer(s)
Train linear classifier:
- Separate positive from negative examples
- The decision boundary normal vector is the Concept Activation Vector (CAV)
Compute TCAV score:
- For a given prediction task, measure: TCAV = fraction of examples where gradient aligns with CAV
- High TCAV → model uses this concept for predictions

Example: Detecting Sentiment Concepts

Concept: "Positive sentiment"
Positive examples: "Great!", "I love this", "Excellent work"
Negative examples: Random sentences

Train CAV → direction in layer 10 that separates these

Test: For sentiment classification task, do gradients point toward CAV?
→ If TCAV = 0.82, then 82% of examples use "positive sentiment" concept

TCAV vs Standard Probes

Aspect	Standard Probe	TCAV
Data needed	Labeled dataset	Concept examples (no labels needed)
Question	"Is X encoded?"	"Is X used causally?"
Measure	Probe accuracy	TCAV score (gradient alignment)
Concept flexibility	Fixed to dataset labels	User-defined concepts
Causal claim	Weaker (encoding ≠ use)	Stronger (tests gradient direction)

Limitations of TCAV

Concept definition: Results depend on how you define concept examples
Linear assumption: Assumes concept is a linear direction (may miss nonlinear concepts)
Statistical testing: Requires enough examples for reliable CAV
Confounds: Concept examples may differ in unintended ways (length, frequency, structure)

2. Concept Bottleneck Models (CBMs)

While TCAV tests concepts post-hoc in trained models, Concept Bottleneck Models (CBMs) build concepts into the architecture during training.

The CBM Architecture

Input → Concept Layer (interpretable) → Prediction Layer

Example (image classification):
Image → [has_beak, has_wings, is_colorful, ...] → Bird species

The middle layer is constrained to represent predefined concepts (e.g., object attributes). This makes the model inherently interpretable.

Advantages of CBMs

Interpretability by design: Predictions explicitly flow through concepts
Human intervention: Users can correct concept predictions at test time
Debugging: If model fails, inspect which concept was wrong
Transfer: Concept representations can transfer to new tasks

Challenges with CBMs

Concept annotation: Requires labeled concept data (expensive)
Concept completeness: Must define all relevant concepts upfront
Performance tradeoff: May sacrifice accuracy vs end-to-end models
Concept leakage: Information may bypass concept bottleneck through residual connections

CBMs for LLMs

While CBMs originated in vision, they're being adapted for NLP:

Text classification: Route through semantic concepts (topic, sentiment, formality)
Question answering: Explicitly extract relevant facts before answering
Controllable generation: Condition on concept activations

3. Validating Concept Methods

Concept-based methods inherit validation challenges from probes, plus additional concerns:

Validation Checklist for TCAV/CBMs

Concept definition:
- Are positive examples truly representative of the concept?
- Do negative examples appropriately contrast?
- Test on held-out concept examples
Confound control:
- Do concept examples differ only in the target concept?
- Test with minimal pairs (Week 4 counterfactuals)
- Check for length, frequency, and structural confounds
Causal validation (TCAV):
- High TCAV score suggests concept is used, but verify with intervention
- Try steering along CAV direction - does behavior change as predicted?
- Ablate CAV direction - does it break concept-related predictions?
Completeness (CBMs):
- Measure concept leakage: does model bypass concept bottleneck?
- Test with concept interventions: manually set concept values
- Verify predictions change appropriately

4. Integrating TCAV with Other Methods

TCAV is most powerful when combined with other interpretability techniques:

Combine With	What You Learn	Example
Probes (this week)	Is concept encoded AND used?	Probe shows "politeness" is encoded in layer 8. TCAV shows it's used for formality predictions.
Patching (Week 4)	Which components implement the concept?	TCAV identifies concept direction. Patch along that direction to verify causality.
Circuits (Week 8)	What circuit computes the concept?	TCAV finds concept, circuits show how it's computed.
SAEs (Week 7)	Do discovered features align with concepts?	Compare SAE features with CAVs for same concept.

5. Application to Your Project

If your concept is high-level or domain-specific, TCAV may be more appropriate than standard probes:

Example: Musical Key

Standard probe: Need labeled dataset of texts annotated with musical keys (hard to get)
TCAV: Provide examples of texts in C major, D major, etc. → easier
Train CAV, test if model uses "key" concept for music-related predictions

Example: Scientific Reasoning

Concept: "Causal reasoning" in scientific texts
Collect examples with causal language ("because," "leads to," "causes")
Train CAV, measure TCAV score for scientific QA task
High score → model relies on causal reasoning; Low score → uses other heuristics

Looking Ahead: Validation and Skepticism

Like standard probes, concept methods can mislead if not carefully validated. We'll see in Week 6 (Skepticism) that even compelling concept interpretations can be illusory. Apply the validation framework from Week 4 rigorously:

Run sanity checks (random model test)
Test on multiple concept example sets
Validate with causal interventions
Compare TCAV with probe results
Report when concept detection fails

In-Class Exercise: Training a Pun Detector Probe

Building on our pun dataset and geometric analysis, we will train probes to detect puns from model activations and analyze whether pun understanding is linearly accessible.

Part 1: Probe Training Setup (15 min)

Prepare your pun dataset for probe training:

Load your pun dataset: Use examples from Weeks 3-4 (at least 100 puns, 100 non-puns)
Extract activations: Run all examples through the model, save activations at key layers
- Focus on layers identified as important in Week 5's causal analysis
- Extract from the final token position (punchline)
Split data: 70% train, 15% validation, 15% test

Part 2: Train and Evaluate Probes (25 min)

Train linear probes across multiple layers:

Train layer-by-layer:
- For layers 0, 5, 10, 15, 20, 25, 30 (adjust for your model size)
- Train a logistic regression classifier: pun (1) vs non-pun (0)
Evaluate performance:
- Report accuracy on test set for each layer
- Create layer-wise accuracy plot
Compare with Week 4:
- Does probe accuracy peak at the same layers where PCA showed separation?
- Compare probe weight direction to your mass-mean-difference "pun direction"
- Compute cosine similarity between probe weights and pun direction

Part 3: Control Tasks and Validation (20 min)

Validate that your probe is detecting humor, not confounds:

Random label baseline:
- Shuffle labels and retrain probes
- If accuracy is still high, probe is overfitting
Selectivity test:
- Does your pun probe fire on non-pun wordplay?
- Does it fire on jokes that are not puns?
- Test on edge cases to understand what it really detects
Linear vs MLP comparison:
- Train a 1-hidden-layer MLP probe at the best layer
- Is accuracy much higher? If so, pun representation may be nonlinear
Causal validation:
- Use your probe weight vector as a steering direction
- Does steering along this direction affect pun-like outputs?

Discussion: What does probe accuracy tell us about how the model represents humor? Is "pun detection" a linearly accessible concept, or does the model use more complex representations?

Open Probe Training Notebook in Colab

Code Exercise

This week's exercise provides hands-on experience with probes and masks:

Train linear probes for concept detection across layers
Compare linear vs MLP probe performance
Implement control tasks to validate probes
Test for overfitting and underfitting
Train learned masks with L1/L0 regularization
Compare probe findings with intervention results
Identify and interpret disagreements between methods

Open Exercise in Colab

Project Milestone

Due: Thursday of Week 5

Train linear probes to extract your concept from model activations. Analyze probe weights and decision boundaries to understand how the concept is encoded, and test where the concept is linearly accessible.

Probe Training and Analysis

Prepare training data:
- Use your benchmark dataset (minimum 2000-5000 examples recommended)
- Ensure balanced classes (equal positive and negative examples)
- Split into train/validation/test sets (70/15/15 or similar)
Train probes across layers:
- Train linear probes at every layer (or every N layers for large models)
- Focus especially on layers identified as important in Week 4's causal analysis
- Train on the same token positions you identified as critical in Week 4
Evaluate probe performance:
- Report accuracy, precision, recall, F1 for each layer
- Create layer-wise accuracy plot to visualize where concept is most accessible
- Compare across different token positions
Analyze probe weights and decision boundaries:
- Examine probe weight vectors: what directions do they learn?
- Compare to steering vectors from Week 1: are they similar?
- Visualize decision boundaries in PCA space (from Week 3)
- Are decision boundaries clean and linear, or complex?
Control experiments:
- Random label baseline: does probe learn structure or just memorize?
- Compare linear vs nonlinear (MLP) probes: is concept linearly represented?
- Selectivity test: does probe respond to related but distinct concepts?

Deliverables:

Probe performance results:
- Layer-wise accuracy plot
- Performance table (accuracy/F1) for key layers
- Comparison across token positions
Probe analysis:
- Weight vector analysis and comparison to steering vectors
- Decision boundary visualizations
- Linear vs nonlinear probe comparison
Control experiment results:
- Random baseline performance
- Selectivity test results
Interpretation:
- Where is your concept most linearly accessible?
- Does this match causal findings from Week 4?
- Is the representation truly linear or does it require nonlinearity?
Code: Notebook with probe training, evaluation, and analysis

Strong probe performance with clean linear decision boundaries suggests your concept has a simple geometric structure that can be reliably extracted—ideal for building explanations and interventions.