Week 8: Circuits - Neural Mechanics

Tutorial: Reverse-Engineering Neural Circuits

1. What is a Circuit?

A circuit is a computational subgraph of a neural network that implements a specific algorithm or behavior. Think of it like reverse-engineering: given a complex machine (the full network), identify the minimal set of components that perform a particular function.

Key Properties of a Good Circuit

Minimal: Includes only components necessary for the behavior
Sufficient: The circuit alone can perform the behavior
Faithful: The circuit's operation actually explains how the full model works
Interpretable: We can understand what computation each component performs

Circuit Discovery Workflow

Identify behavior: What specific computation are we studying?
Hypothesis formation: What components might be involved?
Causal testing: Use patching/ablation to test necessity
Composition analysis: How do components communicate?
Minimality testing: Can we remove any components?
Faithfulness validation: Does it work across distribution?

2. Path Patching: Tracing Information Flow

Path patching extends activation patching to trace specific paths through the network. Instead of patching an entire layer, we patch only the pathway from component A to component B.

The Setup

Consider two runs: clean and corrupted. We want to know: does information flow from component A (e.g., attention head 3.2) to component B (e.g., attention head 7.5)?

Layer 3, Head 2 → [some path] → Layer 7, Head 5

Path Patching Procedure

Run clean and corrupted: Get activations for both inputs
Identify source and target: A is source (layer 3, head 2), B is target (layer 7, head 5)
Patch the connection: Run corrupted input, but replace A's contribution to B's input with clean A's contribution
Measure effect: Does patching this specific path restore clean behavior?

Implementation Detail

In transformers, component A writes to the residual stream. Component B reads from the residual stream. To patch the path A→B:

Run corrupted forward pass up to B
Replace A's output (its write to residual) with clean A's output
Let B read from this modified residual stream
Continue forward pass normally

High path patching effect = A causally influences B

3. Composition Types in Attention Circuits

Multi-head attention circuits communicate through composition: one head's output influences another head's computation. There are three main types:

Q-Composition (Query Composition)

Head A's output affects what Head B pays attention to by modifying B's query vectors.

Head A output → adds to residual → Head B reads → computes queries → attention pattern changes

Example: Binding circuits use Q-composition. An earlier head identifies an entity, and a later head uses this information to determine where to look for attributes.

K-Composition (Key Composition)

Head A's output affects what Head B attends to by modifying B's key vectors.

Head A output → adds to residual → Head B reads → computes keys → changes what gets attended

Example: Induction circuits use K-composition. The previous-token head modifies key representations so the induction head can find the right token to copy from.

V-Composition (Value Composition)

Head A's output affects what information Head B extracts by modifying B's value vectors.

Head A output → adds to residual → Head B reads → computes values → changes what information is moved

How to distinguish: Path patch from A to B's Q/K/V inputs separately. Whichever has the highest effect tells you the composition type.

4. Case Study 1: The Induction Circuit

Induction is the simplest and most fundamental circuit: copy the token that previously followed the current pattern.

The Behavior


      Input: "When Mary and John went to the store, Mary gave a drink to"

      Model predicts: "John" (copies the token that followed "Mary" earlier)

The Two-Head Circuit

The induction circuit requires two attention heads working together:

Head 1: Previous-Token Head (early layer)

Attends from each token to the previous token
Creates a "shifted" representation: position i contains information about token i-1
Writes this to the residual stream

Head 2: Induction Head (later layer)

Reads the shifted representation from Head 1
Looks for positions where the previous token matches the current token
Attends to the next token after that match
Copies information forward to predict the next token

"Mary and John ... Mary"
↓ Previous-token head (layer 2)
Position "Mary₂" now encodes info about "..."
↓ Induction head (layer 6) via K-composition
Keys modified → finds "Mary₁" followed by "and"
↓ Attends to "John"
Predicts: "John"

Why K-Composition?

The composition is through keys (K-composition):

Previous-token head modifies the residual stream
Induction head computes keys from the modified residual
These modified keys help it find the matching earlier occurrence
The induction head's attention pattern changes based on previous-token head output

Testing the Circuit

Necessity (ablation):

Ablate previous-token head → induction fails
Ablate induction head → induction fails
Both are necessary

Sufficiency (path patching):

Path patch: previous-token head → induction head (via K) → high effect
Path patch: previous-token head → induction head (via Q or V) → low effect
Confirms K-composition

Minimality:

Two heads are sufficient
Removing either breaks the behavior
This is a minimal circuit

5. Case Study 2: Binding Circuits

Binding is about associating attributes with entities: remembering that "tall" goes with person 1 and "short" goes with person 2.

The Behavior


      "The tall person and the short person walked into a room.

      The tall person sat down."


      Question: Which person sat? → person 1 (the tall one)

The Binding Circuit

Nikhil's work identified a multi-head circuit for maintaining attribute-entity bindings:

Phase 1: Create Bindings (during first mention)

Attribute extraction heads identify "tall", "short"
Entity heads identify "person" mentions
Binding heads associate attributes with entity positions
Store binding vectors in the residual stream

Phase 2: Retrieve Bindings (during second mention)

Query head processes "the tall person"
Uses Q-composition to look up which position was "tall"
Retrieves the bound entity information

"The TALL person₁ and the SHORT person₂"
↓ Binding heads (layer 4-6)
Residual stores: position₁ ← tall, position₂ ← short
↓
"The TALL person sat"
↓ Query head (layer 8) via Q-composition
Queries modified → retrieves position₁ binding
→ Resolves to person₁

Why Q-Composition?

The composition is through queries (Q-composition):

Binding heads write to residual stream
Query head computes queries from modified residual
These modified queries determine where to look for the right entity
The attention pattern is controlled by the binding information

Comparing Induction and Binding

Aspect	Induction Circuit	Binding Circuit
Composition Type	K-composition	Q-composition
What's Modified	What gets attended to (keys)	Where to attend from (queries)
Complexity	2 heads minimum	Multiple heads (3-5)
Task	Pattern copying	Attribute-entity association
Information Stored	Token identity (shifted)	Binding vectors

6. Case Study 3: Concept Induction (Feucht et al.)

The Feucht paper discovered that induction operates at two levels: token-level and concept-level.

Token-Level Induction

The standard induction circuit we just studied: copy exact tokens.


      "When Mary and John went to the store, Mary gave a drink to"

      → Predicts: "John" (exact token match)

Concept-Level Induction

A parallel mechanism that copies semantic patterns, not just tokens.


      "When Mary and John went to the store, Mary gave a drink to"

      BUT if "John" is edited in the model's weights to "Jonathan":

      → Still predicts: "Jonathan" (semantic association, not token)

The Dual-Route Model

The model uses both routes simultaneously:

Route 1: Token Circuit (attention-based)

Uses induction heads as described above
Fast, pattern-matching based
Works even for nonsense tokens

Route 2: Concept Circuit (MLP-based)

MLPs store semantic associations (like ROME showed)
Retrieves based on meaning, not position
Can override token-based predictions

Input: "Mary ... Mary"
↙ ↘
Token Route (attention) Concept Route (MLP)
↓ ↓
"John" (pattern match) "Jonathan" (semantic)
↘ ↙
Combined prediction

When Each Route Dominates

Strong context: Token route dominates (exact pattern repetition)
Weak context: Concept route dominates (semantic association)
Novel tokens: Token route only (no concept stored)
Common entities: Routes compete, can conflict

Implications for Circuit Analysis

This reveals an important lesson: behaviors can have multiple implementations. Circuit discovery must account for:

Redundant circuits that solve the same task
Context-dependent circuit selection
Interactions between circuit types (attention vs MLP)
Graceful degradation when one circuit is ablated

7. Automated Circuit Discovery: ACDC

Manually finding circuits is tedious. Automated Circuit Discovery (ACDC) systematically searches for circuits using path patching.

The ACDC Algorithm

Start with full model: All components are candidates
For each edge (A→B):
- Path patch from A to B
- Measure causal effect
- If effect is below threshold, remove edge
Iteratively prune: Remove low-effect edges
Result: Minimal subgraph with only high-effect connections

Advantages

Systematic: Tests all possible edges
Minimal: Automatically prunes unnecessary components
Reproducible: Consistent methodology

Limitations

Computational cost: O(components²) path patching operations
Threshold sensitivity: Results depend on pruning threshold
Linear assumption: May miss nonlinear interactions
Dataset dependent: Circuit may vary across different examples

Best Practices

Use hierarchical pruning: layers first, then heads, then neurons
Test on diverse examples, not just one
Validate discovered circuit with ablation studies
Check faithfulness on out-of-distribution data

8. Testing Circuit Faithfulness

A circuit is faithful if it actually explains how the full model works, not just a spurious correlation.

Faithfulness Tests

1. Ablation Test:

Ablate all components outside the circuit
If behavior is preserved, circuit is sufficient
If behavior breaks, circuit is incomplete

2. Out-of-Distribution Test:

Test circuit on examples different from discovery dataset
Does it still work? Then it's likely faithful
If it fails, it may have overfit to the discovery distribution

3. Adversarial Test:

Create inputs designed to break the circuit's logic
Example: For induction, add conflicting patterns
Does the circuit fail gracefully as expected?

4. Intervention Test:

Modify circuit components with specific interventions
Do outputs change in predictable ways?
Example: If you flip a binding vector, does the attribute switch entities?

5. Necessity and Sufficiency:

Necessary: Removing any component breaks behavior
Sufficient: Circuit alone reproduces behavior
Both must hold for faithful circuit

9. Common Pitfalls in Circuit Analysis

Pitfall 1: Confusing Correlation with Causation

Just because a component activates during a behavior doesn't mean it causes the behavior. Always use causal interventions (patching/ablation) to test.

Pitfall 2: Ignoring Polysemanticity

A single head might implement multiple circuits for different tasks. Test that your circuit is specific to your target behavior.

Pitfall 3: Dataset Overfitting

Circuits found on one dataset may not generalize. Always validate on diverse, out-of-distribution examples.

Pitfall 4: Over-Pruning

Setting the pruning threshold too high removes genuinely important components. Validate that your "minimal" circuit still works.

Pitfall 5: Missing Redundancy

Models often have redundant circuits (like token and concept induction). Ablating one circuit may not reveal it if another compensates.

10. Research Workflow for Circuit Discovery

Define target behavior precisely: What computation are you studying?
Create minimal test cases: Simple examples that isolate the behavior
Hypothesize candidate components: Based on prior work or initial observations
Test necessity with ablation: Which components are required?
Test composition with path patching: How do components communicate?
Identify composition type: Q, K, or V composition?
Map complete circuit: Draw the computational graph
Test minimality: Can any component be removed?
Validate faithfulness: Does it work out-of-distribution?
Characterize failure modes: When and how does the circuit break?
Compare with other circuits: How does this relate to known circuits?

Goal: Not just "this circuit exists" but "this is how the model solves this task, and here's why this architecture makes sense."

Overview

Learning Objectives

Required Readings

Tutorial: Reverse-Engineering Neural Circuits

1. What is a Circuit?

Key Properties of a Good Circuit

Circuit Discovery Workflow

2. Path Patching: Tracing Information Flow

The Setup

Path Patching Procedure

Implementation Detail

3. Composition Types in Attention Circuits

Q-Composition (Query Composition)

K-Composition (Key Composition)

V-Composition (Value Composition)

4. Case Study 1: The Induction Circuit

The Behavior

The Two-Head Circuit

Why K-Composition?

Testing the Circuit

5. Case Study 2: Binding Circuits

The Behavior

The Binding Circuit

Why Q-Composition?

Comparing Induction and Binding

6. Case Study 3: Concept Induction (Feucht et al.)

Token-Level Induction

Concept-Level Induction

The Dual-Route Model

When Each Route Dominates

Implications for Circuit Analysis

7. Automated Circuit Discovery: ACDC

The ACDC Algorithm

Advantages

Limitations

Best Practices

8. Testing Circuit Faithfulness

Faithfulness Tests

9. Common Pitfalls in Circuit Analysis

Pitfall 1: Confusing Correlation with Causation

Pitfall 2: Ignoring Polysemanticity

Pitfall 3: Dataset Overfitting

Pitfall 4: Over-Pruning

Pitfall 5: Missing Redundancy

10. Research Workflow for Circuit Discovery

Code Exercise

In-Class Exercise: Discovering Pun Circuits

Part 1: Synthesize Previous Findings (15 min)

Part 2: EAP-IG Circuit Discovery (25 min)

Part 3: Path Patching Validation (20 min)

Project Milestone

Circuit Mapping

Deliverables: