Northeastern Logo

Week 8: Circuits

Overview

So far, we've learned to observe representations (visualization), manipulate them (steering), and test causal importance (patching). Now we put it all together: reverse-engineering complete computational circuits. A circuit is a minimal, faithful subgraph of the network that implements a specific behavior. This week focuses on circuit discovery methodology using three fundamental case studies: induction (pattern-copying), binding (attribute-entity associations), and concept induction (semantic-level patterns).

Learning Objectives

By the end of this week, you should be able to:

Required Readings

Olsson et al. (2022). The canonical circuit analysis showing how induction heads emerge and drive in-context learning.
Conmy et al. (2023). ACDC: Automated circuit discovery addressing the scalability challenge of manual analysis.
Hanna, Pezzelle & Belinkov (2024). EAP-IG: Improved circuit discovery combining gradient-based attribution with causal intervention.

Tutorial: Reverse-Engineering Neural Circuits

1. What is a Circuit?

A circuit is a computational subgraph of a neural network that implements a specific algorithm or behavior. Think of it like reverse-engineering: given a complex machine (the full network), identify the minimal set of components that perform a particular function.

Key Properties of a Good Circuit

Circuit Discovery Workflow

  1. Identify behavior: What specific computation are we studying?
  2. Hypothesis formation: What components might be involved?
  3. Causal testing: Use patching/ablation to test necessity
  4. Composition analysis: How do components communicate?
  5. Minimality testing: Can we remove any components?
  6. Faithfulness validation: Does it work across distribution?

2. Path Patching: Tracing Information Flow

Path patching extends activation patching to trace specific paths through the network. Instead of patching an entire layer, we patch only the pathway from component A to component B.

The Setup

Consider two runs: clean and corrupted. We want to know: does information flow from component A (e.g., attention head 3.2) to component B (e.g., attention head 7.5)?

Layer 3, Head 2 → [some path] → Layer 7, Head 5

Path Patching Procedure

  1. Run clean and corrupted: Get activations for both inputs
  2. Identify source and target: A is source (layer 3, head 2), B is target (layer 7, head 5)
  3. Patch the connection: Run corrupted input, but replace A's contribution to B's input with clean A's contribution
  4. Measure effect: Does patching this specific path restore clean behavior?

Implementation Detail

In transformers, component A writes to the residual stream. Component B reads from the residual stream. To patch the path A→B:

  1. Run corrupted forward pass up to B
  2. Replace A's output (its write to residual) with clean A's output
  3. Let B read from this modified residual stream
  4. Continue forward pass normally

High path patching effect = A causally influences B

3. Composition Types in Attention Circuits

Multi-head attention circuits communicate through composition: one head's output influences another head's computation. There are three main types:

Q-Composition (Query Composition)

Head A's output affects what Head B pays attention to by modifying B's query vectors.

Head A output → adds to residual → Head B reads → computes queries → attention pattern changes

Example: Binding circuits use Q-composition. An earlier head identifies an entity, and a later head uses this information to determine where to look for attributes.

K-Composition (Key Composition)

Head A's output affects what Head B attends to by modifying B's key vectors.

Head A output → adds to residual → Head B reads → computes keys → changes what gets attended

Example: Induction circuits use K-composition. The previous-token head modifies key representations so the induction head can find the right token to copy from.

V-Composition (Value Composition)

Head A's output affects what information Head B extracts by modifying B's value vectors.

Head A output → adds to residual → Head B reads → computes values → changes what information is moved

How to distinguish: Path patch from A to B's Q/K/V inputs separately. Whichever has the highest effect tells you the composition type.

4. Case Study 1: The Induction Circuit

Induction is the simplest and most fundamental circuit: copy the token that previously followed the current pattern.

The Behavior

Input: "When Mary and John went to the store, Mary gave a drink to"
Model predicts: "John" (copies the token that followed "Mary" earlier)

The Two-Head Circuit

The induction circuit requires two attention heads working together:

Head 1: Previous-Token Head (early layer)

Head 2: Induction Head (later layer)

"Mary and John ... Mary"
↓ Previous-token head (layer 2)
Position "Mary₂" now encodes info about "..."
↓ Induction head (layer 6) via K-composition
Keys modified → finds "Mary₁" followed by "and"
↓ Attends to "John"
Predicts: "John"

Why K-Composition?

The composition is through keys (K-composition):

Testing the Circuit

Necessity (ablation):

Sufficiency (path patching):

Minimality:

5. Case Study 2: Binding Circuits

Binding is about associating attributes with entities: remembering that "tall" goes with person 1 and "short" goes with person 2.

The Behavior

"The tall person and the short person walked into a room.
The tall person sat down."

Question: Which person sat? → person 1 (the tall one)

The Binding Circuit

Nikhil's work identified a multi-head circuit for maintaining attribute-entity bindings:

Phase 1: Create Bindings (during first mention)

Phase 2: Retrieve Bindings (during second mention)

"The TALL person₁ and the SHORT person₂"
↓ Binding heads (layer 4-6)
Residual stores: position₁ ← tall, position₂ ← short

"The TALL person sat"
↓ Query head (layer 8) via Q-composition
Queries modified → retrieves position₁ binding
→ Resolves to person₁

Why Q-Composition?

The composition is through queries (Q-composition):

Comparing Induction and Binding

Aspect Induction Circuit Binding Circuit
Composition Type K-composition Q-composition
What's Modified What gets attended to (keys) Where to attend from (queries)
Complexity 2 heads minimum Multiple heads (3-5)
Task Pattern copying Attribute-entity association
Information Stored Token identity (shifted) Binding vectors

6. Case Study 3: Concept Induction (Feucht et al.)

The Feucht paper discovered that induction operates at two levels: token-level and concept-level.

Token-Level Induction

The standard induction circuit we just studied: copy exact tokens.

"When Mary and John went to the store, Mary gave a drink to"
→ Predicts: "John" (exact token match)

Concept-Level Induction

A parallel mechanism that copies semantic patterns, not just tokens.

"When Mary and John went to the store, Mary gave a drink to"
BUT if "John" is edited in the model's weights to "Jonathan":
→ Still predicts: "Jonathan" (semantic association, not token)

The Dual-Route Model

The model uses both routes simultaneously:

Route 1: Token Circuit (attention-based)

Route 2: Concept Circuit (MLP-based)

Input: "Mary ... Mary"
↙ ↘
Token Route (attention) Concept Route (MLP)
↓ ↓
"John" (pattern match) "Jonathan" (semantic)
↘ ↙
Combined prediction

When Each Route Dominates

Implications for Circuit Analysis

This reveals an important lesson: behaviors can have multiple implementations. Circuit discovery must account for:

7. Automated Circuit Discovery: ACDC

Manually finding circuits is tedious. Automated Circuit Discovery (ACDC) systematically searches for circuits using path patching.

The ACDC Algorithm

  1. Start with full model: All components are candidates
  2. For each edge (A→B):
    • Path patch from A to B
    • Measure causal effect
    • If effect is below threshold, remove edge
  3. Iteratively prune: Remove low-effect edges
  4. Result: Minimal subgraph with only high-effect connections

Advantages

Limitations

Best Practices

8. Testing Circuit Faithfulness

A circuit is faithful if it actually explains how the full model works, not just a spurious correlation.

Faithfulness Tests

1. Ablation Test:

2. Out-of-Distribution Test:

3. Adversarial Test:

4. Intervention Test:

5. Necessity and Sufficiency:

9. Common Pitfalls in Circuit Analysis

Pitfall 1: Confusing Correlation with Causation

Just because a component activates during a behavior doesn't mean it causes the behavior. Always use causal interventions (patching/ablation) to test.

Pitfall 2: Ignoring Polysemanticity

A single head might implement multiple circuits for different tasks. Test that your circuit is specific to your target behavior.

Pitfall 3: Dataset Overfitting

Circuits found on one dataset may not generalize. Always validate on diverse, out-of-distribution examples.

Pitfall 4: Over-Pruning

Setting the pruning threshold too high removes genuinely important components. Validate that your "minimal" circuit still works.

Pitfall 5: Missing Redundancy

Models often have redundant circuits (like token and concept induction). Ablating one circuit may not reveal it if another compensates.

10. Research Workflow for Circuit Discovery

  1. Define target behavior precisely: What computation are you studying?
  2. Create minimal test cases: Simple examples that isolate the behavior
  3. Hypothesize candidate components: Based on prior work or initial observations
  4. Test necessity with ablation: Which components are required?
  5. Test composition with path patching: How do components communicate?
  6. Identify composition type: Q, K, or V composition?
  7. Map complete circuit: Draw the computational graph
  8. Test minimality: Can any component be removed?
  9. Validate faithfulness: Does it work out-of-distribution?
  10. Characterize failure modes: When and how does the circuit break?
  11. Compare with other circuits: How does this relate to known circuits?

Goal: Not just "this circuit exists" but "this is how the model solves this task, and here's why this architecture makes sense."

Code Exercise

This week's exercise provides hands-on experience with circuit discovery:

Open EAP-IG Circuit Discovery in Colab Open Anthropic Circuit Tracer in Colab

In-Class Exercise: Discovering Pun Circuits

We will attempt automated circuit discovery for pun processing, synthesizing insights from previous weeks to map the computational pathway from setup to punchline recognition.

Part 1: Synthesize Previous Findings (15 min)

Before automated discovery, compile what you already know:

  1. From Week 2 (Neuronpedia): Which SAE features related to humor did you find?
  2. From Week 4 (Geometry): Which layers showed best pun/non-pun separation?
  3. From Week 5 (Causal tracing): Which layers × positions were causally important?
  4. From Week 6 (Probes): Where did pun probes perform best?
  5. From Week 7 (Attribution): Which input tokens had highest attribution?

Create a hypothesis: "Pun processing likely involves [specific layers], activated by [specific positions], and uses [specific attention patterns]."

Part 2: EAP-IG Circuit Discovery (25 min)

Apply gradient-based attribution to discover which attention heads matter for pun recognition:

  1. Define the task:
    • Use your trained pun probe as the target (or logit difference for pun-related tokens)
    • Run EAP-IG to compute importance scores for each attention head
  2. Analyze head importance:
    • Which heads have highest importance for pun recognition?
    • Are they in the layers you predicted?
    • Create a visualization: head importance across layers
  3. Compare to your hypothesis:
    • Do automatically discovered components match your manual findings?
    • Any surprises (important heads you did not predict)?

Part 3: Path Patching Validation (20 min)

Validate the discovered circuit with causal interventions:

  1. Identify candidate circuit:
    • Take top-5 most important heads from EAP-IG
    • These form your candidate "pun circuit"
  2. Path patching test:
    • Patch activations from pun → non-pun at these heads only
    • Does this transfer pun recognition?
    • Compute: what percentage of total effect is captured by this circuit?
  3. Ablation test:
    • Ablate the candidate circuit heads
    • Does pun probe accuracy drop?
    • How much of the pun effect is eliminated?

Discussion: Does pun understanding have a clean, minimal circuit, or is it distributed across many components? How does this compare to simpler circuits like induction?

Open EAP-IG Notebook in Colab Open Circuit Tracer Notebook in Colab

Project Milestone

Due: Thursday of Week 8

Map the computational pathway for your concept: identify how features and components combine across layers to compute your concept. Build a mechanistic explanation of the algorithm the model uses.

Circuit Mapping

Deliverables:

A good circuit explanation should let someone understand not just WHERE the concept is computed, but HOW and WHY the model computes it that way.