Week 9: Training Dynamics & Model Editing

Learning Objectives

By the end of this week, you should be able to:

Understand how neural network mechanisms emerge during training
Study phase transitions and grokking phenomena
Track circuit formation across training checkpoints
Apply interpretability methods to training trajectories
Identify when your concept's representation emerges
Understand why some mechanisms form suddenly vs gradually
Connect training dynamics to final model behavior

Required Readings

Progress Measures for Grokking via Mechanistic Interpretability
Nanda, Chan, Lieberum, Smith & Steinhardt (2023). Fully reverse-engineers how transformers learn modular addition, decomposing grokking into three phases.
Mass-Editing Memory in a Transformer
Meng, Sharma, Andonian, Belinkov & Bau (2023). MEMIT: Extends ROME to edit thousands of facts simultaneously.
In-Context Algebra
Todd, Wang, Gurnee & Bau (2025). Studies how transformers learn mechanisms for in-context algebraic reasoning with pure variables.

Supplementary Readings

What Needs to Go Right for an Induction Head?
Edelman, Gurnee & Edelman (2024). Optogenetics-inspired causal framework for studying how induction heads form.

1. Why Study Training Dynamics?

The Formation Question

Weeks 1-8 studied your concept in a fully trained model. But critical questions remain:

When does your concept's circuit form during training?
Does it emerge suddenly (phase transition) or gradually?
Which components appear first? Which come later?
What training dynamics (loss curves, gradient patterns) coincide with emergence?
Can we predict when a capability will emerge from training dynamics?

Why This Matters for Research:

Understanding how mechanisms form helps explain why they exist. Training dynamics reveal:

What inductive biases lead models to learn your concept
Whether your concept is a "natural" solution or coincidental
How robust the mechanism is (sudden vs gradual = different stability)
What minimal training is needed for the concept to appear

Use Cases for Your Project

Example: Induction Heads

Olsson et al. (2022) found induction heads emerge in a sharp phase transition around 2B tokens. Before: random attention. After: systematic pattern copying. This sudden emergence suggests induction is a "natural" circuit that SGD discovers reliably.

2. Grokking: Delayed Generalization

The Grokking Phenomenon

Grokking (Power et al., 2022) is when a model:

Overfits training data quickly (100% train accuracy)
Remains at chance on validation for many epochs
Suddenly achieves perfect generalization after extended training

Key Observation: During the "grokking" phase, the model has already memorized the training data but is still learning to generalize. Something internal changes even though training loss is flat.

Mechanistic Explanation

Nanda et al. (2023) "Progress measures for grokking via mechanistic interpretability" showed:

Two competing circuits: A "memorization circuit" (fast to learn, doesn't generalize) and a "generalizing circuit" (slow to learn, works on all data)
Circuit competition: Initially, memorization circuit dominates. Over time, weight decay and other regularization favor the generalizing circuit
Phase transition: When generalizing circuit becomes stronger, validation accuracy suddenly jumps

Experiment: Modular Addition

Task: Predict (a + b) mod 113

Memorization circuit: Lookup table in early layers
Generalizing circuit: Fourier features that exploit group structure

By tracking the strength of each circuit (via ablation) across checkpoints, Nanda et al. could predict exactly when grokking would occur—before it showed up in validation metrics!

3. Phase Transitions in Learning

Sudden vs Gradual Emergence

Not all mechanisms form gradually. Some appear in sharp phase transitions:

Mechanism Type	Emergence Pattern	Example
Simple features	Gradual	Edge detectors in CNNs
Composed circuits	Phase transition	Induction heads (Olsson et al.)
Algorithm discovery	Phase transition (grokking)	Modular arithmetic (Nanda et al.)

The Induction Head Phase Transition

Case Study: Olsson et al. (2022) "In-context Learning and Induction Heads"

Finding: Induction heads (the circuit enabling in-context learning) form suddenly during training

Timeline:

Before 2B tokens: No induction behavior, random attention patterns
Around 2B tokens: Sharp increase in induction score over ~50M tokens
After 2.5B tokens: Fully formed induction heads, strong ICL performance

What emerges:

Previous token heads: Attend to previous occurrence of current token
Induction heads: Attend to token after the previous occurrence
Composition: These heads compose via Q-K circuit

Key Insight: ICL ability correlates perfectly with induction head formation—suggesting induction heads are the mechanism for ICL, at least in small models.

4. Methods: Tracking Mechanism Formation

Checkpoint-Based Analysis

To study training dynamics, you need to apply your interpretability methods to models at different training stages:

Save checkpoints: Every N steps (e.g., every 1000 steps, or logarithmically spaced)
Apply methods: Run your Week 3-8 analyses on each checkpoint
- Probing accuracy over time
- Circuit component strengths (ablation effects)
- Attention pattern analysis
- SAE feature activation patterns
Plot trajectories: How do these metrics evolve?
Identify transitions: When do sudden changes occur?

Key Metrics to Track

Metric	What It Measures	Interpretation
Probe accuracy	Linear readability of concept	When does concept become linearly accessible?
Ablation effect	Causal importance of component	When does component become necessary?
Attention patterns	Information routing	When do specific attention heads form?
Loss on concept-specific data	Task-specific performance	When does model learn the concept?
Logit lens evolution	Prediction development across layers	How does processing change over training?

Detecting Phase Transitions

How to identify a phase transition:

Sharp gradient: Metric changes rapidly over short training period
Temporal correlation: Multiple related metrics change simultaneously
Bifurcation point: Before = one behavior, after = qualitatively different
Reproducibility: Transition occurs at similar point across random seeds

5. Circuit Formation Patterns

Hierarchical Assembly

Complex circuits often form in stages:

Foundation components form first (e.g., attention heads that move information)
Processing components form next (e.g., MLPs that transform representations)
Composition emerges last (components wire together into full circuit)

Example: Induction Circuit Assembly

Step 1 (~1.5B tokens): Previous token heads form
Step 2 (~2B tokens): Induction heads form
Step 3 (~2.2B tokens): Q-K composition circuit connects them

Each stage builds on the previous. Without previous token heads, induction heads can't function.

Redundancy and Pruning

Training dynamics often show:

Overproduction: Many components initially contribute to a task
Consolidation: Over training, contribution concentrates in fewer components
Specialization: Components become more selective/focused

6. Practical Considerations

Computational Costs

Challenge: Full training runs are expensive. Strategies to reduce cost:

Use small models: GPT-2 Small, Pythia-160M for initial studies
Fine-tune instead of pretraining: Study how concept emerges during task-specific fine-tuning
Sparse checkpoints: Log-spaced checkpoints capture transitions with fewer saves
Targeted metrics: Don't run all analyses on all checkpoints—focus on key transitions

Experimental Design

For your project:

Option A: Fine-tuning study
- Start with pretrained model (e.g., GPT-2)
- Fine-tune on dataset emphasizing your concept
- Track when concept-specific circuit strengthens
- Faster and cheaper than pretraining from scratch
Option B: Small-scale pretraining
- Train small model (e.g., 2-layer transformer) from scratch
- On simplified task involving your concept
- Complete control over training dynamics
- Can run multiple seeds for robustness
Option C: Using public checkpoints
- Pythia suite provides checkpoints at many training steps
- No training cost—just analysis
- Limited to what checkpoints are available

7. Connecting to Your Research

Research Questions Enabled by Training Dynamics

Necessity: Is your circuit necessary for the task, or just correlated? (If it forms when the capability emerges, it's likely necessary)
Sufficiency: Can early checkpoints solve the task if you manually strengthen the circuit? (Test via activation patching)
Robustness: Does the circuit form consistently across seeds? (Universal vs incidental)
Simplicity: Do simpler circuits form before more complex alternatives? (Inductive bias)

Integration with Previous Weeks

Week-by-Week Application:

Week 3 (Visualization): Track logit lens evolution—when does concept prediction sharpen?
Week 4 (Causal): Plot ablation effects over time—when do components become necessary?
Week 5 (Probes): Probe accuracy trajectory—when does linear representation emerge?
Week 7 (SAEs): Feature activation over training—do features sharpen or stay diffuse?
Week 8 (Circuits): Path patching at different checkpoints—when does composition happen?

8. Case Studies

Case 1: Modular Addition Grokking

Task: Learn (a + b) mod p for prime p

Observations:

Training accuracy: 100% at epoch 1000
Validation accuracy: ~0% until epoch 10,000, then jumps to 100%

Mechanistic findings (Nanda et al.):

Memorization circuit (lookup table) forms by epoch 1000
Fourier feature circuit (generalizable algorithm) forms gradually during epochs 1000-10,000
Weight decay slowly shrinks memorization circuit while Fourier circuit grows stronger
At crossover point (~epoch 9500), Fourier circuit dominates → grokking

Prediction: By tracking circuit strengths with ablation, could predict grokking 2000 epochs early

Case 2: Induction Head Formation

Task: Language modeling (next-token prediction)

Observations (Olsson et al.):

In-context learning score near 0 before 2B tokens
Sharp increase from 2.0B to 2.5B tokens
Plateau at high ICL performance after 2.5B tokens

Mechanistic findings:

1.5B tokens: Previous token heads form
2.0B tokens: Induction heads begin forming
2.2B tokens: Q-K composition between head types strengthens
2.5B tokens: Full induction circuit operational

Key insight: Phase transition is sharp (~500M tokens) but hierarchical assembly takes longer

9. Common Patterns and Principles

Empirical Observations

Patterns across multiple studies:

Simple before complex: Individual components form before composed circuits
Multiple solutions: Models often try many approaches before settling on one
Sudden composition: Components exist separately, then suddenly wire together
Lottery ticket-like: Some random initializations have "winning" subnetworks that form faster
Regularization matters: Weight decay, dropout affect which circuits win

10. Limitations and Open Questions

What We Don't Understand

Why sharp transitions? What causes phase transitions vs smooth learning?
Prediction: Can we predict emergence timing from early training signals?
Intervention: Can we speed up or redirect circuit formation?
Scaling: Do these patterns hold in very large models?
Generality: Are these findings specific to transformers or universal?

11. Summary

Key Takeaways

Mechanisms don't exist from initialization—they form during training through specific dynamics
Formation can be sudden (phase transition) or gradual depending on circuit complexity
Grokking reveals competing circuits—memorization vs generalization
Induction heads show hierarchical assembly—components before composition
Checkpoint analysis lets you apply interpretability methods across training
Training dynamics reveal why mechanisms exist—what makes them "natural" solutions

For Your Project

Choose: fine-tuning study, small-scale pretraining, or public checkpoints (Pythia)
Save checkpoints at appropriate intervals (log-spaced recommended)
Apply your Week 3-8 methods to each checkpoint
Look for phase transitions, grokking, or gradual emergence
Connect emergence timing to your circuit's structure and function

References & Resources

Core Papers

Nanda et al. (2023): "Progress measures for grokking via mechanistic interpretability." ICLR. arXiv:2301.05217
Olsson et al. (2022): "In-context Learning and Induction Heads." Transformer Circuits Thread. Link
Power et al. (2022): "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." ICLR. arXiv:2201.02177

Related Work

Chughtai et al. (2024): "A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations." arXiv:2402.15057
Chan et al. (2024): "The Developmental Landscape of In-Context Learning." arXiv:2402.02364
Elhage et al. (2022): "Toy Models of Superposition." Transformer Circuits Thread. Link
Wei et al. (2022): "Emergent Abilities of Large Language Models." TMLR. arXiv:2206.07682
Liu et al. (2023): "Towards Understanding Grokking: An Effective Theory of Representation Learning." NeurIPS. arXiv:2205.10343

Tools and Resources

Pythia Suite: Models with checkpoints at many training steps GitHub
TransformerLens: Library for mechanistic interpretability with checkpoint support GitHub
Grokking reproduction code: Nanda's experiments GitHub

In-Class Exercise: When Do Puns Emerge?

Using NDIF (National Deep Inference Fabric) to access OLMo training checkpoints, we will track when pun understanding emerges during pretraining. This connects our pun thread to training dynamics.

Part 1: Setup and Checkpoint Selection (15 min)

Access OLMo checkpoints through NDIF:

Connect to NDIF: Use the provided notebook to access the OLMo checkpoint suite
Select checkpoints: Choose 8-10 checkpoints spanning training:
- Early: steps 1k, 5k, 10k (before most capabilities emerge)
- Middle: steps 50k, 100k, 200k (capability formation)
- Late: steps 500k, 1M, final (mature model)
Prepare pun test set: Load your puns from Week 3 (10-15 examples)

Part 2: Track Pun Capabilities Over Training (25 min)

Measure pun understanding at each checkpoint:

Pun completion accuracy:
- For each checkpoint, test: can it complete pun setups correctly?
- "I used to be a banker, but I lost ___" → does it predict "interest"?
- Record top-5 predictions and their probabilities
Probe accuracy over training:
- Apply your Week 6 pun probe to each checkpoint
- Does pun/non-pun separation improve over training?
- At which checkpoint does probe accuracy exceed 70%? 90%?
Logit lens at each checkpoint:
- For a fixed pun, run logit lens at early, middle, and late checkpoints
- Does the "correct" punchline emerge at earlier layers as training progresses?

Part 3: Identify Phase Transitions (20 min)

Look for sudden changes in pun capability:

Plot capability curves:
- X-axis: training step (log scale)
- Y-axis: pun completion accuracy, probe accuracy, or other metric
Identify transitions:
- Is pun understanding gradual or sudden?
- If sudden, at what training step does it emerge?
- Does it correlate with other capability emergence (e.g., general language ability)?
Compare to in-context learning:
- Test pun completion with and without pun examples in context
- At early checkpoints, does in-context help more than at late checkpoints?
- Does ICL for puns emerge at the same time as induction heads?

Discussion: Is pun understanding a distinct capability that emerges at a specific point, or does it improve gradually with general language ability? What does this tell us about how models learn to process humor?

Open Pun Emergence Notebook in Colab

Note: Requires NDIF access for loading OLMo checkpoints.

Project Milestone

Due: Thursday of Week 9

Study when and how your concept's circuit emerges during training. Apply your interpretability methods to multiple training checkpoints to track mechanism formation.

Training Dynamics Study

Choose your approach:
- Option A (Recommended): Use Pythia checkpoints—apply your methods to pre-saved checkpoints at different training steps
- Option B: Fine-tune your selected model on concept-rich data, saving checkpoints every N steps
- Option C: Train a small model from scratch on a simplified task involving your concept
Save/select checkpoints:
- Choose 8-15 checkpoints spanning early to late training
- Log-spaced recommended (e.g., steps 100, 300, 1k, 3k, 10k, 30k, 100k)
- Ensure you capture pre-emergence and post-emergence states
Apply interpretability methods to each checkpoint:
- From Week 5: Probe accuracy over training—when does concept become linearly accessible?
- From Week 4: Ablation effects over training—when do components become necessary?
- From Week 3: Logit lens evolution—when does prediction sharpen?
- From Week 8: Circuit strength (path patching)—when does composition emerge?
Analyze emergence pattern:
- Is emergence gradual or sudden (phase transition)?
- Do multiple metrics change simultaneously?
- What is the timeline of component formation?
- Is there hierarchical assembly (some parts before others)?

Deliverables:

Training trajectory plots:
- Probe accuracy vs training step
- Ablation effect size vs training step
- Key metrics showing mechanism emergence
- Annotate identified phase transitions or inflection points
Emergence analysis:
- When does your concept's circuit form? (specific training step/token count)
- Is formation sudden or gradual?
- Do components form in sequence or simultaneously?
- Comparison to loss curves—does emergence coincide with capability gain?
Mechanistic interpretation:
- Why does the circuit form when it does?
- Is there evidence of competing mechanisms (grokking-style)?
- What does emergence timing reveal about your concept's role?
Code: Notebook with checkpoint analysis and visualization

Training dynamics reveal whether your circuit is a "natural" solution that forms reliably, or an accidental feature specific to your model's initialization and training procedure.