Learning Objectives
By the end of this week, you should be able to:
- Understand how neural network mechanisms emerge during training
- Study phase transitions and grokking phenomena
- Track circuit formation across training checkpoints
- Apply interpretability methods to training trajectories
- Identify when your concept's representation emerges
- Understand why some mechanisms form suddenly vs gradually
- Connect training dynamics to final model behavior
Required Readings
Supplementary Readings
1. Why Study Training Dynamics?
The Formation Question
Weeks 1-8 studied your concept in a fully trained model. But critical questions remain:
- When does your concept's circuit form during training?
- Does it emerge suddenly (phase transition) or gradually?
- Which components appear first? Which come later?
- What training dynamics (loss curves, gradient patterns) coincide with emergence?
- Can we predict when a capability will emerge from training dynamics?
Why This Matters for Research:
Understanding how mechanisms form helps explain why they exist. Training dynamics reveal:
- What inductive biases lead models to learn your concept
- Whether your concept is a "natural" solution or coincidental
- How robust the mechanism is (sudden vs gradual = different stability)
- What minimal training is needed for the concept to appear
Use Cases for Your Project
Example: Induction Heads
Olsson et al. (2022) found induction heads emerge in a sharp phase transition around 2B tokens. Before: random attention. After: systematic pattern copying. This sudden emergence suggests induction is a "natural" circuit that SGD discovers reliably.
2. Grokking: Delayed Generalization
The Grokking Phenomenon
Grokking (Power et al., 2022) is when a model:
- Overfits training data quickly (100% train accuracy)
- Remains at chance on validation for many epochs
- Suddenly achieves perfect generalization after extended training
Key Observation: During the "grokking" phase, the model has already memorized the training data but is still learning to generalize. Something internal changes even though training loss is flat.
Mechanistic Explanation
Nanda et al. (2023) "Progress measures for grokking via mechanistic interpretability" showed:
- Two competing circuits: A "memorization circuit" (fast to learn, doesn't generalize) and a "generalizing circuit" (slow to learn, works on all data)
- Circuit competition: Initially, memorization circuit dominates. Over time, weight decay and other regularization favor the generalizing circuit
- Phase transition: When generalizing circuit becomes stronger, validation accuracy suddenly jumps
Experiment: Modular Addition
Task: Predict (a + b) mod 113
Memorization circuit: Lookup table in early layers
Generalizing circuit: Fourier features that exploit group structure
By tracking the strength of each circuit (via ablation) across checkpoints, Nanda et al. could predict exactly when grokking would occur—before it showed up in validation metrics!
3. Phase Transitions in Learning
Sudden vs Gradual Emergence
Not all mechanisms form gradually. Some appear in sharp phase transitions:
| Mechanism Type |
Emergence Pattern |
Example |
| Simple features |
Gradual |
Edge detectors in CNNs |
| Composed circuits |
Phase transition |
Induction heads (Olsson et al.) |
| Algorithm discovery |
Phase transition (grokking) |
Modular arithmetic (Nanda et al.) |
The Induction Head Phase Transition
Case Study: Olsson et al. (2022) "In-context Learning and Induction Heads"
Finding: Induction heads (the circuit enabling in-context learning) form suddenly during training
Timeline:
- Before 2B tokens: No induction behavior, random attention patterns
- Around 2B tokens: Sharp increase in induction score over ~50M tokens
- After 2.5B tokens: Fully formed induction heads, strong ICL performance
What emerges:
- Previous token heads: Attend to previous occurrence of current token
- Induction heads: Attend to token after the previous occurrence
- Composition: These heads compose via Q-K circuit
Key Insight: ICL ability correlates perfectly with induction head formation—suggesting induction heads are the mechanism for ICL, at least in small models.
4. Methods: Tracking Mechanism Formation
Checkpoint-Based Analysis
To study training dynamics, you need to apply your interpretability methods to models at different training stages:
- Save checkpoints: Every N steps (e.g., every 1000 steps, or logarithmically spaced)
- Apply methods: Run your Week 3-8 analyses on each checkpoint
- Probing accuracy over time
- Circuit component strengths (ablation effects)
- Attention pattern analysis
- SAE feature activation patterns
- Plot trajectories: How do these metrics evolve?
- Identify transitions: When do sudden changes occur?
Key Metrics to Track
| Metric |
What It Measures |
Interpretation |
| Probe accuracy |
Linear readability of concept |
When does concept become linearly accessible? |
| Ablation effect |
Causal importance of component |
When does component become necessary? |
| Attention patterns |
Information routing |
When do specific attention heads form? |
| Loss on concept-specific data |
Task-specific performance |
When does model learn the concept? |
| Logit lens evolution |
Prediction development across layers |
How does processing change over training? |
Detecting Phase Transitions
How to identify a phase transition:
- Sharp gradient: Metric changes rapidly over short training period
- Temporal correlation: Multiple related metrics change simultaneously
- Bifurcation point: Before = one behavior, after = qualitatively different
- Reproducibility: Transition occurs at similar point across random seeds
5. Circuit Formation Patterns
Hierarchical Assembly
Complex circuits often form in stages:
- Foundation components form first (e.g., attention heads that move information)
- Processing components form next (e.g., MLPs that transform representations)
- Composition emerges last (components wire together into full circuit)
Example: Induction Circuit Assembly
- Step 1 (~1.5B tokens): Previous token heads form
- Step 2 (~2B tokens): Induction heads form
- Step 3 (~2.2B tokens): Q-K composition circuit connects them
Each stage builds on the previous. Without previous token heads, induction heads can't function.
Redundancy and Pruning
Training dynamics often show:
- Overproduction: Many components initially contribute to a task
- Consolidation: Over training, contribution concentrates in fewer components
- Specialization: Components become more selective/focused
6. Practical Considerations
Computational Costs
Challenge: Full training runs are expensive. Strategies to reduce cost:
- Use small models: GPT-2 Small, Pythia-160M for initial studies
- Fine-tune instead of pretraining: Study how concept emerges during task-specific fine-tuning
- Sparse checkpoints: Log-spaced checkpoints capture transitions with fewer saves
- Targeted metrics: Don't run all analyses on all checkpoints—focus on key transitions
Experimental Design
For your project:
- Option A: Fine-tuning study
- Start with pretrained model (e.g., GPT-2)
- Fine-tune on dataset emphasizing your concept
- Track when concept-specific circuit strengthens
- Faster and cheaper than pretraining from scratch
- Option B: Small-scale pretraining
- Train small model (e.g., 2-layer transformer) from scratch
- On simplified task involving your concept
- Complete control over training dynamics
- Can run multiple seeds for robustness
- Option C: Using public checkpoints
- Pythia suite provides checkpoints at many training steps
- No training cost—just analysis
- Limited to what checkpoints are available
7. Connecting to Your Research
Research Questions Enabled by Training Dynamics
- Necessity: Is your circuit necessary for the task, or just correlated? (If it forms when the capability emerges, it's likely necessary)
- Sufficiency: Can early checkpoints solve the task if you manually strengthen the circuit? (Test via activation patching)
- Robustness: Does the circuit form consistently across seeds? (Universal vs incidental)
- Simplicity: Do simpler circuits form before more complex alternatives? (Inductive bias)
Integration with Previous Weeks
Week-by-Week Application:
- Week 3 (Visualization): Track logit lens evolution—when does concept prediction sharpen?
- Week 4 (Causal): Plot ablation effects over time—when do components become necessary?
- Week 5 (Probes): Probe accuracy trajectory—when does linear representation emerge?
- Week 7 (SAEs): Feature activation over training—do features sharpen or stay diffuse?
- Week 8 (Circuits): Path patching at different checkpoints—when does composition happen?
8. Case Studies
Case 1: Modular Addition Grokking
Task: Learn (a + b) mod p for prime p
Observations:
- Training accuracy: 100% at epoch 1000
- Validation accuracy: ~0% until epoch 10,000, then jumps to 100%
Mechanistic findings (Nanda et al.):
- Memorization circuit (lookup table) forms by epoch 1000
- Fourier feature circuit (generalizable algorithm) forms gradually during epochs 1000-10,000
- Weight decay slowly shrinks memorization circuit while Fourier circuit grows stronger
- At crossover point (~epoch 9500), Fourier circuit dominates → grokking
Prediction: By tracking circuit strengths with ablation, could predict grokking 2000 epochs early
Case 2: Induction Head Formation
Task: Language modeling (next-token prediction)
Observations (Olsson et al.):
- In-context learning score near 0 before 2B tokens
- Sharp increase from 2.0B to 2.5B tokens
- Plateau at high ICL performance after 2.5B tokens
Mechanistic findings:
- 1.5B tokens: Previous token heads form
- 2.0B tokens: Induction heads begin forming
- 2.2B tokens: Q-K composition between head types strengthens
- 2.5B tokens: Full induction circuit operational
Key insight: Phase transition is sharp (~500M tokens) but hierarchical assembly takes longer
9. Common Patterns and Principles
Empirical Observations
Patterns across multiple studies:
- Simple before complex: Individual components form before composed circuits
- Multiple solutions: Models often try many approaches before settling on one
- Sudden composition: Components exist separately, then suddenly wire together
- Lottery ticket-like: Some random initializations have "winning" subnetworks that form faster
- Regularization matters: Weight decay, dropout affect which circuits win
10. Limitations and Open Questions
What We Don't Understand
- Why sharp transitions? What causes phase transitions vs smooth learning?
- Prediction: Can we predict emergence timing from early training signals?
- Intervention: Can we speed up or redirect circuit formation?
- Scaling: Do these patterns hold in very large models?
- Generality: Are these findings specific to transformers or universal?
11. Summary
Key Takeaways
- Mechanisms don't exist from initialization—they form during training through specific dynamics
- Formation can be sudden (phase transition) or gradual depending on circuit complexity
- Grokking reveals competing circuits—memorization vs generalization
- Induction heads show hierarchical assembly—components before composition
- Checkpoint analysis lets you apply interpretability methods across training
- Training dynamics reveal why mechanisms exist—what makes them "natural" solutions
For Your Project
- Choose: fine-tuning study, small-scale pretraining, or public checkpoints (Pythia)
- Save checkpoints at appropriate intervals (log-spaced recommended)
- Apply your Week 3-8 methods to each checkpoint
- Look for phase transitions, grokking, or gradual emergence
- Connect emergence timing to your circuit's structure and function
References & Resources
Core Papers
- Nanda et al. (2023): "Progress measures for grokking via mechanistic interpretability." ICLR. arXiv:2301.05217
- Olsson et al. (2022): "In-context Learning and Induction Heads." Transformer Circuits Thread. Link
- Power et al. (2022): "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." ICLR. arXiv:2201.02177
Related Work
- Chughtai et al. (2024): "A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations." arXiv:2402.15057
- Chan et al. (2024): "The Developmental Landscape of In-Context Learning." arXiv:2402.02364
- Elhage et al. (2022): "Toy Models of Superposition." Transformer Circuits Thread. Link
- Wei et al. (2022): "Emergent Abilities of Large Language Models." TMLR. arXiv:2206.07682
- Liu et al. (2023): "Towards Understanding Grokking: An Effective Theory of Representation Learning." NeurIPS. arXiv:2205.10343
Tools and Resources
- Pythia Suite: Models with checkpoints at many training steps GitHub
- TransformerLens: Library for mechanistic interpretability with checkpoint support GitHub
- Grokking reproduction code: Nanda's experiments GitHub
In-Class Exercise: When Do Puns Emerge?
Using NDIF (National Deep Inference Fabric) to access OLMo training checkpoints, we will track
when pun understanding emerges during pretraining. This connects our pun thread to training dynamics.
Part 1: Setup and Checkpoint Selection (15 min)
Access OLMo checkpoints through NDIF:
- Connect to NDIF: Use the provided notebook to access the OLMo checkpoint suite
- Select checkpoints: Choose 8-10 checkpoints spanning training:
- Early: steps 1k, 5k, 10k (before most capabilities emerge)
- Middle: steps 50k, 100k, 200k (capability formation)
- Late: steps 500k, 1M, final (mature model)
- Prepare pun test set: Load your puns from Week 3 (10-15 examples)
Part 2: Track Pun Capabilities Over Training (25 min)
Measure pun understanding at each checkpoint:
- Pun completion accuracy:
- For each checkpoint, test: can it complete pun setups correctly?
- "I used to be a banker, but I lost ___" → does it predict "interest"?
- Record top-5 predictions and their probabilities
- Probe accuracy over training:
- Apply your Week 6 pun probe to each checkpoint
- Does pun/non-pun separation improve over training?
- At which checkpoint does probe accuracy exceed 70%? 90%?
- Logit lens at each checkpoint:
- For a fixed pun, run logit lens at early, middle, and late checkpoints
- Does the "correct" punchline emerge at earlier layers as training progresses?
Part 3: Identify Phase Transitions (20 min)
Look for sudden changes in pun capability:
- Plot capability curves:
- X-axis: training step (log scale)
- Y-axis: pun completion accuracy, probe accuracy, or other metric
- Identify transitions:
- Is pun understanding gradual or sudden?
- If sudden, at what training step does it emerge?
- Does it correlate with other capability emergence (e.g., general language ability)?
- Compare to in-context learning:
- Test pun completion with and without pun examples in context
- At early checkpoints, does in-context help more than at late checkpoints?
- Does ICL for puns emerge at the same time as induction heads?
Discussion: Is pun understanding a distinct capability that emerges at a specific point,
or does it improve gradually with general language ability? What does this tell us about how models
learn to process humor?
Open Pun Emergence Notebook in Colab
Note: Requires NDIF access for loading OLMo checkpoints.
Project Milestone
Due: Thursday of Week 9
Study when and how your concept's circuit emerges during training. Apply your interpretability methods
to multiple training checkpoints to track mechanism formation.
Training Dynamics Study
- Choose your approach:
- Option A (Recommended): Use Pythia checkpoints—apply your methods to pre-saved checkpoints at different training steps
- Option B: Fine-tune your selected model on concept-rich data, saving checkpoints every N steps
- Option C: Train a small model from scratch on a simplified task involving your concept
- Save/select checkpoints:
- Choose 8-15 checkpoints spanning early to late training
- Log-spaced recommended (e.g., steps 100, 300, 1k, 3k, 10k, 30k, 100k)
- Ensure you capture pre-emergence and post-emergence states
- Apply interpretability methods to each checkpoint:
- From Week 5: Probe accuracy over training—when does concept become linearly accessible?
- From Week 4: Ablation effects over training—when do components become necessary?
- From Week 3: Logit lens evolution—when does prediction sharpen?
- From Week 8: Circuit strength (path patching)—when does composition emerge?
- Analyze emergence pattern:
- Is emergence gradual or sudden (phase transition)?
- Do multiple metrics change simultaneously?
- What is the timeline of component formation?
- Is there hierarchical assembly (some parts before others)?
Deliverables:
- Training trajectory plots:
- Probe accuracy vs training step
- Ablation effect size vs training step
- Key metrics showing mechanism emergence
- Annotate identified phase transitions or inflection points
- Emergence analysis:
- When does your concept's circuit form? (specific training step/token count)
- Is formation sudden or gradual?
- Do components form in sequence or simultaneously?
- Comparison to loss curves—does emergence coincide with capability gain?
- Mechanistic interpretation:
- Why does the circuit form when it does?
- Is there evidence of competing mechanisms (grokking-style)?
- What does emergence timing reveal about your concept's role?
- Code: Notebook with checkpoint analysis and visualization
Training dynamics reveal whether your circuit is a "natural" solution that forms reliably, or an
accidental feature specific to your model's initialization and training procedure.