Tutorial: Auxiliary Models for Interpretability
1. Why Use Auxiliary Models?
In previous weeks, we used direct intervention: patching activations, ablating components,
tracing circuits. These methods are powerful but have limitations:
- Expensive: Testing every component requires many forward passes
- Discrete: Hard ablations aren't differentiable
- Hypothesis-driven: You need to know what to look for
Auxiliary models offer complementary benefits:
- Efficient: Train once, test anywhere
- Differentiable: Can use gradient-based optimization
- Exploratory: Can discover patterns you didn't expect
Two Main Types
Probes: Small models that read information from representations
Hidden State → [Probe] → Predicted Concept
Masks: Learned parameters that identify important components
Component × [Mask Weight] → Masked Component → Output
Key Question: What Do They Tell Us?
Critical distinction: Probes and masks show what could be done with representations,
not necessarily what the model actually does. Always validate with causal interventions.
2. Linear Probes: Reading Out Information
A linear probe is the simplest auxiliary model: a linear classifier trained to extract specific
information from hidden states.
The Setup
Given hidden states h at some layer, train a linear classifier to predict concept y:
ŷ = Wh + b
Where W is a weight matrix and b is a bias vector.
Training Procedure
- Extract representations: Run model on labeled data, save hidden states
- Freeze main model: Don't update the main model's weights
- Train classifier: Optimize
W and b to predict labels
- Evaluate: Test on held-out data
What High Probe Accuracy Means
If a linear probe achieves high accuracy, it tells us:
- ✓ The information is linearly accessible in the representation
- ✓ A simple linear transformation can extract it
- ✗ Does NOT prove the model uses this information causally
- ✗ Does NOT prove the model uses it in this way
Example: Probing for Sentiment
Text: "This movie was terrible"
Hidden state at layer 8: [h₁, h₂, ..., h₇₆₈]
Linear probe: ŷ = W·h + b
Prediction: Negative sentiment (98% confidence)
Interpretation: Layer 8 contains linearly accessible sentiment information. But does the model
actually use this information? Need to verify with interventions.
3. Nonlinear Probes: MLP Probes
Sometimes information isn't linearly accessible. MLP probes (multi-layer perceptrons) can
extract nonlinear patterns.
Architecture
ŷ = W₂ · ReLU(W₁h + b₁) + b₂
This adds a hidden layer with nonlinear activation, allowing more complex transformations.
The Interpretability Tradeoff
| Aspect |
Linear Probe |
MLP Probe |
| Expressiveness |
Limited (linear only) |
High (nonlinear patterns) |
| Interpretability |
Clear (linear direction) |
Opaque (nonlinear transformation) |
| What it measures |
Linearly accessible info |
Computationally accessible info |
| Overfitting risk |
Lower |
Higher |
When to Use Each
- Start with linear: Simpler, more interpretable, sufficient for many concepts
- Try MLP if: Linear probe fails but you believe information is present
- Compare both: Gap between linear and MLP performance reveals nonlinearity
Warning: High MLP probe accuracy with low linear probe accuracy suggests information is present
but not in a simple, interpretable form. The model may not use it the way your MLP probe does.
4. Probe Training: Methodology and Best Practices
Dataset Construction
1. Balanced classes:
- Equal (or known) distribution of labels
- Prevents probe from learning spurious correlations
- Example: 50% positive, 50% negative sentiment
2. Minimal confounds:
- Control for other variables that correlate with target
- Example: If probing for tense, balance for sentiment
- Otherwise probe might learn sentiment instead of tense
3. Diverse examples:
- Different sentence structures, lengths, vocabularies
- Test generalization, not memorization
Train/Test Splits
- Standard split: 80% train, 10% validation, 10% test
- Cross-validation: For small datasets, use k-fold CV
- Distribution shift test: Create test set from different domain
Training Hyperparameters
- Learning rate: Start with 1e-3, tune if needed
- Regularization: L2 penalty to prevent overfitting
- Early stopping: Stop when validation accuracy plateaus
- Batch size: 32-128 typically works well
5. Control Tasks: Validating Probe Behavior
Control tasks verify that probes actually measure what you think they measure, not spurious
correlations.
Selectivity Test
Question: Does the probe specifically extract your target concept, or does it respond to other
properties?
Method:
- Train probe on your concept (e.g., "is sentence past tense?")
- Create control dataset with different concept (e.g., "is sentence negative?")
- Test probe on control dataset
- Probe should have ~random accuracy on control task
If probe succeeds on control: It's not specifically measuring your concept, it's picking up a
confound.
Random Label Test
Question: Is the probe overfitting to training data?
Method:
- Randomize labels in training set
- Train probe on random labels
- If probe achieves high accuracy, it's overfitting
- Real patterns should not be learnable from random labels
Layer Progression Test
Question: Which layers contain the information?
Method:
- Train probes at every layer
- Plot accuracy vs layer
- Reveals where information emerges and flows
6. Probe Pitfalls: What Can Go Wrong
Pitfall 1: Overfitting
Problem: Probe learns spurious patterns specific to training data.
Symptoms:
- High training accuracy, low test accuracy
- Probe uses many parameters (high capacity)
- Works on training distribution, fails on shifted distribution
Solutions:
- Regularization (L2 penalty, dropout)
- More training data
- Simpler probe (linear instead of MLP)
- Early stopping based on validation set
- Cross-validation
Pitfall 2: Underfitting
Problem: Probe is too simple to extract available information.
Symptoms:
- Low accuracy on both training and test sets
- Linear probe fails but concept seems extractable
- Gap between human intuition and probe performance
Solutions:
- Try nonlinear probe (MLP)
- Increase probe capacity
- Check if information is actually present (use intervention)
- Try different layers
- Verify data quality and labels
Pitfall 3: The "Information Presence ≠ Causal Use" Fallacy
Problem: High probe accuracy doesn't prove the model uses that information.
Probe success: Information is ACCESSIBLE
≠
Model uses it: Information is USED CAUSALLY
Example:
You can probe for "number of vowels in sentence"
Probe achieves 95% accuracy
→ Information is linearly accessible
But does the model use vowel count for predictions?
→ Probably not! It's likely a spurious byproduct
Solution: Validate with interventions
- Use activation patching to modify the probed direction
- If output changes as expected, information is causally used
- If output unchanged, information is merely present
Pitfall 4: Confounds and Spurious Correlations
Problem: Probe learns correlated features instead of target concept.
Example:
- Goal: Probe for whether sentence mentions "France"
- Confound: Most France examples also mention "Paris"
- Probe might learn to detect "Paris" instead of "France"
Solutions:
- Use control tasks to test selectivity
- Balance confounds in training data
- Test on adversarial examples (France without Paris)
7. Learned Masks: Identifying Important Components
Instead of hard ablation, learned masks use continuous parameters to identify which components
matter.
The Setup
Each component (neuron, head, layer) gets a mask parameter m ∈ [0, 1]:
output = m × component_activation
Where m=1 means "keep component" and m=0 means "ablate component".
Training Objective
Optimize masks to maintain model performance while minimizing mask sum:
Loss = Task_Loss + λ · Σ(mask_weights)
- Task_Loss: Keep model predictions accurate
- Σ(mask_weights): Prefer sparse masks (fewer active components)
- λ: Regularization strength (controls sparsity)
Interpretation
- High mask weight (m ≈ 1): Component is important for task
- Low mask weight (m ≈ 0): Component is not necessary
- Intermediate weights: Component has some effect
8. Hard Ablation vs Soft Masking vs Learned Masks
| Method |
How it Works |
Advantages |
Disadvantages |
| Hard Ablation |
Set activation to zero |
• Clear interpretation
• Causal claim
• No training needed
|
• Not differentiable
• Expensive (test each)
• Binary (on/off)
|
| Soft Masking |
Multiply by weight m ∈ [0,1] |
• Differentiable
• Continuous effect
• Can interpolate
|
• Still need to set m
• Interpretation less clear
• Not truly causal
|
| Learned Masks |
Train m to optimize objective |
• Automated discovery
• Differentiable
• Finds importance jointly
|
• Requires training
• Task-specific
• Correlation not causation
|
When to Use Each
- Hard ablation: Testing specific hypotheses, validating causal claims
- Soft masking: Analyzing effect magnitude, interpolating between states
- Learned masks: Exploratory discovery, automated pruning, gradient-based search
9. Sparse Regularization for Interpretable Masks
Without regularization, learned masks tend to keep most components active (not interpretable). Sparsity
regularization encourages simpler solutions.
L1 Regularization
Add sum of absolute mask weights to loss:
Loss = Task_Loss + λ · Σ|mᵢ|
Effect: Drives small weights to exactly zero (sparse solution)
L0 Regularization (Approximate)
Penalize the number of non-zero weights:
Loss = Task_Loss + λ · ||m||₀
Since L0 norm is non-differentiable, use continuous relaxations:
- Hard concrete distribution: Samples binary masks during training
- Sigmoid with temperature: Sharp sigmoid approximates step function
Choosing Regularization Strength (λ)
- Too low: Dense masks (many components active), not interpretable
- Too high: Over-pruning, task performance degrades
- Sweet spot: Maximum sparsity while maintaining performance
Practical approach:
- Train with multiple λ values
- Plot: task accuracy vs number of active components
- Choose λ at the "elbow" of the curve
Benefits of Sparse Masks
- Interpretability: Easier to understand with fewer components
- Efficiency: Can actually remove components (pruning)
- Generalization: Simpler models often generalize better
- Falsifiability: Clear claims about which components matter
10. Comparing Probes with Causal Interventions
Probes and interventions often give different answers. Understanding when and why reveals what each method
measures.
Case 1: Agreement (Best Case)
Probe: Concept X is linearly accessible at layer 8
Intervention: Patching layer 8 transfers concept X
→ Information is both PRESENT and CAUSALLY USED
Interpretation: Strong evidence that layer 8 processes concept X.
Case 2: Probe Succeeds, Intervention Fails
Probe: 95% accuracy for concept X
Intervention: Patching has no effect on X-related outputs
→ Information is PRESENT but NOT CAUSALLY USED
Interpretation: The model computes X but doesn't use it for this task. It might be:
- A byproduct of other computations
- Used for different tasks/contexts
- Spuriously correlated information
Case 3: Probe Fails, Intervention Succeeds
Probe: Low accuracy (random guessing)
Intervention: Patching strongly affects X-related outputs
→ Information is USED but NOT LINEARLY ACCESSIBLE
Interpretation: The model uses X in a nonlinear or distributed way that linear probes can't
extract. Try:
- Nonlinear (MLP) probes
- Different layers
- Multiple layers combined
Case 4: Both Fail
Probe: Low accuracy
Intervention: No effect
→ Information is ABSENT (or you're looking in the wrong place)
Best Practice: Use Both Methods
- Start with probes: Fast exploration, identify candidate layers
- Validate with interventions: Test causal role of high-probe layers
- Investigate disagreements: Learn about representational structure
11. Using Masks for Automated Discovery
Learned masks can automate component selection, complementing manual circuit discovery.
Workflow
- Define task: What behavior are you studying?
- Initialize masks: One parameter per component (all start at 1.0)
- Train masks: Optimize to maintain behavior with sparsity penalty
- Threshold: Components with high masks are important
- Validate: Use ablation to verify causal importance
Hierarchical Masking
Progressively narrow down from coarse to fine-grained:
- Layer-level masks: Which layers matter?
- Head-level masks: Within important layers, which heads?
- Neuron-level masks: Within important heads, which neurons?
This reduces search space and computational cost.
Comparison with ACDC (Week 5)
| Aspect |
ACDC (Path Patching) |
Learned Masks |
| Method |
Ablation-based |
Gradient-based |
| Causal claim |
Strong (actual intervention) |
Weak (correlation) |
| Speed |
Slow (test each edge) |
Fast (parallel gradient) |
| Granularity |
Edges between components |
Individual components |
| Best for |
Validating specific circuits |
Exploratory discovery |
12. Research Workflow: Combining Probes and Interventions
- Explore with probes:
- Train probes at all layers for your concept
- Identify candidate layers with high accuracy
- Fast, covers entire model
- Discover with masks:
- Train learned masks on concept-relevant task
- Identify important components (high mask weights)
- Hierarchical: layers → heads → neurons
- Validate with interventions:
- Use activation patching on probe-identified layers
- Ablate mask-identified components
- Verify causal role
- Investigate disagreements:
- If probe succeeds but intervention fails: information present but unused
- If intervention succeeds but probe fails: nonlinear or distributed representation
- Both reveal important properties of how the model works
- Iterate:
- Use findings to refine hypotheses
- Design better probes based on intervention results
- Test new components suggested by masks
Goal: Triangulate understanding using multiple methods. No single technique tells the whole
story.
Extension: Concept-Based Methods (TCAV and Concept Bottleneck Models)
Beyond Token-Level Probes: Concept Activation Vectors
Probes answer: "Is information X encoded at layer L?" But what if you want to test for higher-level concepts like
"politeness," "factual knowledge," or "sentiment"? Testing with Concept Activation Vectors (TCAV)
extends probing to human-interpretable concepts.
1. TCAV: Testing with Concept Activation Vectors (Kim et al., 2018)
The Core Idea
Instead of training a probe on labeled data, TCAV uses user-provided examples to define concepts:
Traditional Probe: "Is part-of-speech encoded?" → Need labeled POS data
TCAV: "Does the model use 'politeness'?" → Provide examples of polite/rude text
The TCAV Method
- Collect concept examples:
- Positive examples: sentences exhibiting your concept (e.g., polite requests)
- Negative examples: random or concept-absent sentences
- Extract activations:
- Run model on all examples
- Collect activations at target layer(s)
- Train linear classifier:
- Separate positive from negative examples
- The decision boundary normal vector is the Concept Activation Vector (CAV)
- Compute TCAV score:
- For a given prediction task, measure:
TCAV = fraction of examples where gradient aligns with CAV
- High TCAV → model uses this concept for predictions
Example: Detecting Sentiment Concepts
Concept: "Positive sentiment"
Positive examples: "Great!", "I love this", "Excellent work"
Negative examples: Random sentences
Train CAV → direction in layer 10 that separates these
Test: For sentiment classification task, do gradients point toward CAV?
→ If TCAV = 0.82, then 82% of examples use "positive sentiment" concept
TCAV vs Standard Probes
| Aspect |
Standard Probe |
TCAV |
| Data needed |
Labeled dataset |
Concept examples (no labels needed) |
| Question |
"Is X encoded?" |
"Is X used causally?" |
| Measure |
Probe accuracy |
TCAV score (gradient alignment) |
| Concept flexibility |
Fixed to dataset labels |
User-defined concepts |
| Causal claim |
Weaker (encoding ≠ use) |
Stronger (tests gradient direction) |
Limitations of TCAV
- Concept definition: Results depend on how you define concept examples
- Linear assumption: Assumes concept is a linear direction (may miss nonlinear concepts)
- Statistical testing: Requires enough examples for reliable CAV
- Confounds: Concept examples may differ in unintended ways (length, frequency, structure)
2. Concept Bottleneck Models (CBMs)
While TCAV tests concepts post-hoc in trained models, Concept Bottleneck Models (CBMs)
build concepts into the architecture during training.
The CBM Architecture
Input → Concept Layer (interpretable) → Prediction Layer
Example (image classification):
Image → [has_beak, has_wings, is_colorful, ...] → Bird species
The middle layer is constrained to represent predefined concepts (e.g., object attributes). This
makes the model inherently interpretable.
Advantages of CBMs
- Interpretability by design: Predictions explicitly flow through concepts
- Human intervention: Users can correct concept predictions at test time
- Debugging: If model fails, inspect which concept was wrong
- Transfer: Concept representations can transfer to new tasks
Challenges with CBMs
- Concept annotation: Requires labeled concept data (expensive)
- Concept completeness: Must define all relevant concepts upfront
- Performance tradeoff: May sacrifice accuracy vs end-to-end models
- Concept leakage: Information may bypass concept bottleneck through residual connections
CBMs for LLMs
While CBMs originated in vision, they're being adapted for NLP:
- Text classification: Route through semantic concepts (topic, sentiment, formality)
- Question answering: Explicitly extract relevant facts before answering
- Controllable generation: Condition on concept activations
3. Validating Concept Methods
Concept-based methods inherit validation challenges from probes, plus additional concerns:
Validation Checklist for TCAV/CBMs
- Concept definition:
- Are positive examples truly representative of the concept?
- Do negative examples appropriately contrast?
- Test on held-out concept examples
- Confound control:
- Do concept examples differ only in the target concept?
- Test with minimal pairs (Week 4 counterfactuals)
- Check for length, frequency, and structural confounds
- Causal validation (TCAV):
- High TCAV score suggests concept is used, but verify with intervention
- Try steering along CAV direction - does behavior change as predicted?
- Ablate CAV direction - does it break concept-related predictions?
- Completeness (CBMs):
- Measure concept leakage: does model bypass concept bottleneck?
- Test with concept interventions: manually set concept values
- Verify predictions change appropriately
4. Integrating TCAV with Other Methods
TCAV is most powerful when combined with other interpretability techniques:
| Combine With |
What You Learn |
Example |
| Probes (this week) |
Is concept encoded AND used? |
Probe shows "politeness" is encoded in layer 8. TCAV shows it's used for formality predictions. |
| Patching (Week 4) |
Which components implement the concept? |
TCAV identifies concept direction. Patch along that direction to verify causality. |
| Circuits (Week 8) |
What circuit computes the concept? |
TCAV finds concept, circuits show how it's computed. |
| SAEs (Week 7) |
Do discovered features align with concepts? |
Compare SAE features with CAVs for same concept. |
5. Application to Your Project
If your concept is high-level or domain-specific, TCAV may be more appropriate than standard probes:
Example: Musical Key
- Standard probe: Need labeled dataset of texts annotated with musical keys (hard to get)
- TCAV: Provide examples of texts in C major, D major, etc. → easier
- Train CAV, test if model uses "key" concept for music-related predictions
Example: Scientific Reasoning
- Concept: "Causal reasoning" in scientific texts
- Collect examples with causal language ("because," "leads to," "causes")
- Train CAV, measure TCAV score for scientific QA task
- High score → model relies on causal reasoning; Low score → uses other heuristics
Looking Ahead: Validation and Skepticism
Like standard probes, concept methods can mislead if not carefully validated. We'll see in Week 6
(Skepticism) that even compelling concept interpretations can be illusory. Apply the validation
framework from Week 4 rigorously:
- Run sanity checks (random model test)
- Test on multiple concept example sets
- Validate with causal interventions
- Compare TCAV with probe results
- Report when concept detection fails