Northeastern Logo

Week 8: Circuits

Overview

So far, we've learned to observe representations (visualization), manipulate them (steering), and test causal importance (patching). Now we put it all together: reverse-engineering complete computational circuits. A circuit is a minimal, faithful subgraph of the network that implements a specific behavior. This week focuses on circuit discovery methodology, with the Indirect Object Identification (IOI) circuit in GPT-2 small as our main case study: how the model identifies who received an action (e.g., "Mary gave the milk to ____" → "John"). We also cover binding (attribute-entity associations), concept induction (semantic-level patterns), automated discovery (ACDC), and testing circuit faithfulness.

Learning Objectives

By the end of this week, you should be able to:

Required Readings

Wang et al. (2022). A circuit analysis showing how GPT-2 small can perform indirect object identification.
Conmy et al. (2023). ACDC: Automated circuit discovery addressing the scalability challenge of manual analysis.
Hanna, Pezzelle & Belinkov (2024). EAP-IG: Improved circuit discovery combining gradient-based attribution with causal intervention.

Tutorial: Reverse-Engineering Neural Circuits

1. What is a Circuit?

A circuit is a computational subgraph of a neural network that implements a specific algorithm or behavior. Think of it like reverse-engineering: given a complex machine (the full network), identify the minimal set of components that perform a particular function.

Key Properties of a Good Circuit

Circuit Discovery Workflow

  1. Identify behavior: What specific computation are we studying?
  2. Hypothesis formation: What components might be involved?
  3. Causal testing: Use patching/ablation to test necessity
  4. Composition analysis: How do components communicate?
  5. Minimality testing: Can we remove any components?
  6. Faithfulness validation: Does it work across distribution?

2. Path Patching: Tracing Information Flow

Path patching extends activation patching to trace specific paths through the network. Instead of patching an entire layer, we patch only the pathway from component A to component B.

The Setup

Consider two runs: clean and corrupted. We want to know: does information flow from component A (e.g., attention head 3.2) to component B (e.g., attention head 7.5)?

Layer 3, Head 2 → [some path] → Layer 7, Head 5

Path Patching Procedure

  1. Run clean and corrupted: Get activations for both inputs
  2. Identify source and target: A is source (layer 3, head 2), B is target (layer 7, head 5)
  3. Patch the connection: Run corrupted input, but replace A's contribution to B's input with clean A's contribution
  4. Measure effect: Does patching this specific path restore clean behavior?

Implementation Detail

In transformers, component A writes to the residual stream. Component B reads from the residual stream. To patch the path A→B:

  1. Run corrupted forward pass up to B
  2. Replace A's output (its write to residual) with clean A's output
  3. Let B read from this modified residual stream
  4. Continue forward pass normally

High path patching effect = A causally influences B

3. Composition Types in Attention Circuits

Multi-head attention circuits communicate through composition: one head's output influences another head's computation. There are three main types:

Q-Composition (Query Composition)

Head A's output affects what Head B pays attention to by modifying B's query vectors.

Head A output → adds to residual → Head B reads → computes queries → attention pattern changes

Example: Binding circuits use Q-composition. An earlier head identifies an entity, and a later head uses this information to determine where to look for attributes.

K-Composition (Key Composition)

Head A's output affects what Head B attends to by modifying B's key vectors.

Head A output → adds to residual → Head B reads → computes keys → changes what gets attended

Example: In the IOI circuit, earlier heads modify key representations so that name mover heads can attend to the correct name (the indirect object) and not the subject.

V-Composition (Value Composition)

Head A's output affects what information Head B extracts by modifying B's value vectors.

Head A output → adds to residual → Head B reads → computes values → changes what information is moved

How to distinguish: Path patch from A to B's Q/K/V inputs separately. Whichever has the highest effect tells you the composition type.

4. Case Study 1: The IOI Circuit (Wang et al.)

Indirect Object Identification (IOI) is a natural language task: given a sentence like "When Mary and John went to the store, Mary gave the milk to ____", the model must predict the indirect object—the person who received the action—i.e., "John", not "Mary" (the subject). Wang et al. (2022) reverse-engineered how GPT-2 small implements this behavior using causal interventions.

The Behavior

Input: "When Mary and John went to the store, Mary gave the milk to"
Model predicts: "John" (the indirect object—who received the milk—not "Mary", the subject)

The Circuit: 26 Heads in 7 Classes

The IOI circuit in GPT-2 small comprises 26 attention heads grouped into 7 main classes. Key roles include:

Name Mover Heads

Duplicate Token Heads

S-Inhibition (Subject Inhibition)

Other Supporting Heads

"When Mary and John ... Mary gave the milk to"
↓ Duplicate token heads: detect repeated "Mary"
↓ S-inhibition: suppress subject "Mary"
↓ Name mover heads: attend to "John" (indirect object)
Predicts: "John"

Discovery and Validation

The circuit was discovered using causal interventions (ablation, path patching). The paper evaluates the explanation with three criteria:

Testing the Circuit

Necessity (ablation):

Sufficiency and minimality:

5. Case Study 2: Binding Circuits

Binding is about associating attributes with entities: remembering that "tall" goes with person 1 and "short" goes with person 2.

The Behavior

"The tall person and the short person walked into a room.
The tall person sat down."

Question: Which person sat? → person 1 (the tall one)

The Binding Circuit

Nikhil's work identified a multi-head circuit for maintaining attribute-entity bindings:

Phase 1: Create Bindings (during first mention)

Phase 2: Retrieve Bindings (during second mention)

"The TALL person₁ and the SHORT person₂"
↓ Binding heads (layer 4-6)
Residual stores: position₁ ← tall, position₂ ← short

"The TALL person sat"
↓ Query head (layer 8) via Q-composition
Queries modified → retrieves position₁ binding
→ Resolves to person₁

Why Q-Composition?

The composition is through queries (Q-composition):

Comparing IOI and Binding

Aspect IOI Circuit Binding Circuit
Composition Type Mix of Q/K/V (name movers, duplicate token, S-inhibition) Q-composition
What's Modified Which name is attended (subject vs. indirect object) Where to attend from (queries)
Complexity 26 heads in 7 classes (GPT-2 small) Multiple heads (3-5)
Task Indirect object identification Attribute-entity association
Information Used Duplicate names, subject vs. object roles Binding vectors

6. Case Study 3: Concept Induction (Feucht et al.)

The Feucht paper discovered that induction operates at two levels: token-level and concept-level.

Token-Level Mechanisms

The IOI circuit we studied relies on token-level structure: duplicate detection and name moving. Similar pattern-copying (e.g., induction) also works at the token level.

"When Mary and John went to the store, Mary gave a drink to"
→ Predicts: "John" (IOI: indirect object; or induction: token that followed first "Mary")

Concept-Level Induction

A parallel mechanism that copies semantic patterns, not just tokens.

"When Mary and John went to the store, Mary gave a drink to"
BUT if "John" is edited in the model's weights to "Jonathan":
→ Still predicts: "Jonathan" (semantic association, not token)

The Dual-Route Model

The model uses both routes simultaneously:

Route 1: Token Circuit (attention-based)

Route 2: Concept Circuit (MLP-based)

Input: "Mary ... Mary"
↙ ↘
Token Route (attention) Concept Route (MLP)
↓ ↓
"John" (pattern match) "Jonathan" (semantic)
↘ ↙
Combined prediction

When Each Route Dominates

Implications for Circuit Analysis

This reveals an important lesson: behaviors can have multiple implementations. Circuit discovery must account for:

7. Automated Circuit Discovery: ACDC

Manually finding circuits is tedious. Automated Circuit Discovery (ACDC) systematically searches for circuits using path patching.

The ACDC Algorithm

  1. Start with full model: All components are candidates
  2. For each edge (A→B):
    • Path patch from A to B
    • Measure causal effect
    • If effect is below threshold, remove edge
  3. Iteratively prune: Remove low-effect edges
  4. Result: Minimal subgraph with only high-effect connections

Advantages

Limitations

Best Practices

8. Testing Circuit Faithfulness

A circuit is faithful if it actually explains how the full model works, not just a spurious correlation.

Faithfulness Tests

1. Ablation Test:

2. Out-of-Distribution Test:

3. Adversarial Test:

4. Intervention Test:

5. Necessity and Sufficiency:

9. Common Pitfalls in Circuit Analysis

Pitfall 1: Confusing Correlation with Causation

Just because a component activates during a behavior doesn't mean it causes the behavior. Always use causal interventions (patching/ablation) to test.

Pitfall 2: Ignoring Polysemanticity

A single head might implement multiple circuits for different tasks. Test that your circuit is specific to your target behavior.

Pitfall 3: Dataset Overfitting

Circuits found on one dataset may not generalize. Always validate on diverse, out-of-distribution examples.

Pitfall 4: Over-Pruning

Setting the pruning threshold too high removes genuinely important components. Validate that your "minimal" circuit still works.

Pitfall 5: Missing Redundancy

Models often have redundant components (e.g., multiple name mover heads in the IOI circuit). Ablating one head may not reveal its importance if others compensate.

10. Research Workflow for Circuit Discovery

  1. Define target behavior precisely: What computation are you studying?
  2. Create minimal test cases: Simple examples that isolate the behavior
  3. Hypothesize candidate components: Based on prior work or initial observations
  4. Test necessity with ablation: Which components are required?
  5. Test composition with path patching: How do components communicate?
  6. Identify composition type: Q, K, or V composition?
  7. Map complete circuit: Draw the computational graph
  8. Test minimality: Can any component be removed?
  9. Validate faithfulness: Does it work out-of-distribution?
  10. Characterize failure modes: When and how does the circuit break?
  11. Compare with other circuits: How does this relate to known circuits?

Goal: Not just "this circuit exists" but "this is how the model solves this task, and here's why this architecture makes sense."

Code Exercise

This week's exercise provides hands-on experience with circuit discovery:

Open EAP-IG Circuit Discovery in Colab Open Anthropic Circuit Tracer in Colab

In-Class Exercise: Discovering Pun Circuits

We will attempt automated circuit discovery for pun processing, synthesizing insights from previous weeks to map the computational pathway from setup to punchline recognition.

Part 1: Synthesize Previous Findings (15 min)

Before automated discovery, compile what you already know:

  1. From Week 2 (Neuronpedia): Which SAE features related to humor did you find?
  2. From Week 4 (Geometry): Which layers showed best pun/non-pun separation?
  3. From Week 5 (Causal tracing): Which layers × positions were causally important?
  4. From Week 6 (Probes): Where did pun probes perform best?
  5. From Week 7 (Attribution): Which input tokens had highest attribution?

Create a hypothesis: "Pun processing likely involves [specific layers], activated by [specific positions], and uses [specific attention patterns]."

Part 2: EAP-IG Circuit Discovery (25 min)

Apply gradient-based attribution to discover which attention heads matter for pun recognition:

  1. Define the task:
    • Use your trained pun probe as the target (or logit difference for pun-related tokens)
    • Run EAP-IG to compute importance scores for each attention head
  2. Analyze head importance:
    • Which heads have highest importance for pun recognition?
    • Are they in the layers you predicted?
    • Create a visualization: head importance across layers
  3. Compare to your hypothesis:
    • Do automatically discovered components match your manual findings?
    • Any surprises (important heads you did not predict)?

Part 3: Path Patching Validation (20 min)

Validate the discovered circuit with causal interventions:

  1. Identify candidate circuit:
    • Take top-5 most important heads from EAP-IG
    • These form your candidate "pun circuit"
  2. Path patching test:
    • Patch activations from pun → non-pun at these heads only
    • Does this transfer pun recognition?
    • Compute: what percentage of total effect is captured by this circuit?
  3. Ablation test:
    • Ablate the candidate circuit heads
    • Does pun probe accuracy drop?
    • How much of the pun effect is eliminated?

Discussion: Does pun understanding have a clean, minimal circuit, or is it distributed across many components? How does this compare to more localized circuits like IOI?

Open EAP-IG Notebook in Colab Open Circuit Tracer Notebook in Colab

Project Milestone

Due: Thursday of Week 8

Map the computational pathway for your concept: identify how features and components combine across layers to compute your concept. Build a mechanistic explanation of the algorithm the model uses.

Circuit Mapping

Deliverables:

A good circuit explanation should let someone understand not just WHERE the concept is computed, but HOW and WHY the model computes it that way.