Northeastern Logo

Week 9: Training Dynamics & Model Editing

Learning Objectives

By the end of this week, you should be able to:

Required Readings

Supplementary Readings

1. Why Study Training Dynamics?

The Formation Question

Weeks 1-8 studied your concept in a fully trained model. But critical questions remain:

Why This Matters for Research:

Understanding how mechanisms form helps explain why they exist. Training dynamics reveal:

Use Cases for Your Project

Example: Induction Heads

Olsson et al. (2022) found induction heads emerge in a sharp phase transition around 2B tokens. Before: random attention. After: systematic pattern copying. This sudden emergence suggests induction is a "natural" circuit that SGD discovers reliably.

2. Grokking: Delayed Generalization

The Grokking Phenomenon

Grokking (Power et al., 2022) is when a model:

  1. Overfits training data quickly (100% train accuracy)
  2. Remains at chance on validation for many epochs
  3. Suddenly achieves perfect generalization after extended training

Key Observation: During the "grokking" phase, the model has already memorized the training data but is still learning to generalize. Something internal changes even though training loss is flat.

Mechanistic Explanation

Nanda et al. (2023) "Progress measures for grokking via mechanistic interpretability" showed:

Experiment: Modular Addition

Task: Predict (a + b) mod 113

Memorization circuit: Lookup table in early layers
Generalizing circuit: Fourier features that exploit group structure

By tracking the strength of each circuit (via ablation) across checkpoints, Nanda et al. could predict exactly when grokking would occur—before it showed up in validation metrics!

3. Phase Transitions in Learning

Sudden vs Gradual Emergence

Not all mechanisms form gradually. Some appear in sharp phase transitions:

Mechanism Type Emergence Pattern Example
Simple features Gradual Edge detectors in CNNs
Composed circuits Phase transition Induction heads (Olsson et al.)
Algorithm discovery Phase transition (grokking) Modular arithmetic (Nanda et al.)

The Induction Head Phase Transition

Case Study: Olsson et al. (2022) "In-context Learning and Induction Heads"

Finding: Induction heads (the circuit enabling in-context learning) form suddenly during training

Timeline:

What emerges:

  1. Previous token heads: Attend to previous occurrence of current token
  2. Induction heads: Attend to token after the previous occurrence
  3. Composition: These heads compose via Q-K circuit

Key Insight: ICL ability correlates perfectly with induction head formation—suggesting induction heads are the mechanism for ICL, at least in small models.

4. Methods: Tracking Mechanism Formation

Checkpoint-Based Analysis

To study training dynamics, you need to apply your interpretability methods to models at different training stages:

  1. Save checkpoints: Every N steps (e.g., every 1000 steps, or logarithmically spaced)
  2. Apply methods: Run your Week 3-8 analyses on each checkpoint
    • Probing accuracy over time
    • Circuit component strengths (ablation effects)
    • Attention pattern analysis
    • SAE feature activation patterns
  3. Plot trajectories: How do these metrics evolve?
  4. Identify transitions: When do sudden changes occur?

Key Metrics to Track

Metric What It Measures Interpretation
Probe accuracy Linear readability of concept When does concept become linearly accessible?
Ablation effect Causal importance of component When does component become necessary?
Attention patterns Information routing When do specific attention heads form?
Loss on concept-specific data Task-specific performance When does model learn the concept?
Logit lens evolution Prediction development across layers How does processing change over training?

Detecting Phase Transitions

How to identify a phase transition:

5. Circuit Formation Patterns

Hierarchical Assembly

Complex circuits often form in stages:

  1. Foundation components form first (e.g., attention heads that move information)
  2. Processing components form next (e.g., MLPs that transform representations)
  3. Composition emerges last (components wire together into full circuit)

Example: Induction Circuit Assembly

  1. Step 1 (~1.5B tokens): Previous token heads form
  2. Step 2 (~2B tokens): Induction heads form
  3. Step 3 (~2.2B tokens): Q-K composition circuit connects them

Each stage builds on the previous. Without previous token heads, induction heads can't function.

Redundancy and Pruning

Training dynamics often show:

6. Practical Considerations

Computational Costs

Challenge: Full training runs are expensive. Strategies to reduce cost:

Experimental Design

For your project:

  1. Option A: Fine-tuning study
    • Start with pretrained model (e.g., GPT-2)
    • Fine-tune on dataset emphasizing your concept
    • Track when concept-specific circuit strengthens
    • Faster and cheaper than pretraining from scratch
  2. Option B: Small-scale pretraining
    • Train small model (e.g., 2-layer transformer) from scratch
    • On simplified task involving your concept
    • Complete control over training dynamics
    • Can run multiple seeds for robustness
  3. Option C: Using public checkpoints
    • Pythia suite provides checkpoints at many training steps
    • No training cost—just analysis
    • Limited to what checkpoints are available

7. Connecting to Your Research

Research Questions Enabled by Training Dynamics

Integration with Previous Weeks

Week-by-Week Application:

8. Case Studies

Case 1: Modular Addition Grokking

Task: Learn (a + b) mod p for prime p

Observations:

Mechanistic findings (Nanda et al.):

Prediction: By tracking circuit strengths with ablation, could predict grokking 2000 epochs early

Case 2: Induction Head Formation

Task: Language modeling (next-token prediction)

Observations (Olsson et al.):

Mechanistic findings:

Key insight: Phase transition is sharp (~500M tokens) but hierarchical assembly takes longer

9. Common Patterns and Principles

Empirical Observations

Patterns across multiple studies:

10. Limitations and Open Questions

What We Don't Understand

11. Summary

Key Takeaways

For Your Project

References & Resources

Core Papers

Related Work

Tools and Resources

In-Class Exercise: When Do Puns Emerge?

Using NDIF (National Deep Inference Fabric) to access OLMo training checkpoints, we will track when pun understanding emerges during pretraining. This connects our pun thread to training dynamics.

Part 1: Setup and Checkpoint Selection (15 min)

Access OLMo checkpoints through NDIF:

  1. Connect to NDIF: Use the provided notebook to access the OLMo checkpoint suite
  2. Select checkpoints: Choose 8-10 checkpoints spanning training:
    • Early: steps 1k, 5k, 10k (before most capabilities emerge)
    • Middle: steps 50k, 100k, 200k (capability formation)
    • Late: steps 500k, 1M, final (mature model)
  3. Prepare pun test set: Load your puns from Week 3 (10-15 examples)

Part 2: Track Pun Capabilities Over Training (25 min)

Measure pun understanding at each checkpoint:

  1. Pun completion accuracy:
    • For each checkpoint, test: can it complete pun setups correctly?
    • "I used to be a banker, but I lost ___" → does it predict "interest"?
    • Record top-5 predictions and their probabilities
  2. Probe accuracy over training:
    • Apply your Week 6 pun probe to each checkpoint
    • Does pun/non-pun separation improve over training?
    • At which checkpoint does probe accuracy exceed 70%? 90%?
  3. Logit lens at each checkpoint:
    • For a fixed pun, run logit lens at early, middle, and late checkpoints
    • Does the "correct" punchline emerge at earlier layers as training progresses?

Part 3: Identify Phase Transitions (20 min)

Look for sudden changes in pun capability:

  1. Plot capability curves:
    • X-axis: training step (log scale)
    • Y-axis: pun completion accuracy, probe accuracy, or other metric
  2. Identify transitions:
    • Is pun understanding gradual or sudden?
    • If sudden, at what training step does it emerge?
    • Does it correlate with other capability emergence (e.g., general language ability)?
  3. Compare to in-context learning:
    • Test pun completion with and without pun examples in context
    • At early checkpoints, does in-context help more than at late checkpoints?
    • Does ICL for puns emerge at the same time as induction heads?

Discussion: Is pun understanding a distinct capability that emerges at a specific point, or does it improve gradually with general language ability? What does this tell us about how models learn to process humor?

Open Pun Emergence Notebook in Colab

Note: Requires NDIF access for loading OLMo checkpoints.

Project Milestone

Due: Thursday of Week 9

Study when and how your concept's circuit emerges during training. Apply your interpretability methods to multiple training checkpoints to track mechanism formation.

Training Dynamics Study

  • Choose your approach:
    • Option A (Recommended): Use Pythia checkpoints—apply your methods to pre-saved checkpoints at different training steps
    • Option B: Fine-tune your selected model on concept-rich data, saving checkpoints every N steps
    • Option C: Train a small model from scratch on a simplified task involving your concept
  • Save/select checkpoints:
    • Choose 8-15 checkpoints spanning early to late training
    • Log-spaced recommended (e.g., steps 100, 300, 1k, 3k, 10k, 30k, 100k)
    • Ensure you capture pre-emergence and post-emergence states
  • Apply interpretability methods to each checkpoint:
    • From Week 5: Probe accuracy over training—when does concept become linearly accessible?
    • From Week 4: Ablation effects over training—when do components become necessary?
    • From Week 3: Logit lens evolution—when does prediction sharpen?
    • From Week 8: Circuit strength (path patching)—when does composition emerge?
  • Analyze emergence pattern:
    • Is emergence gradual or sudden (phase transition)?
    • Do multiple metrics change simultaneously?
    • What is the timeline of component formation?
    • Is there hierarchical assembly (some parts before others)?

Deliverables:

  • Training trajectory plots:
    • Probe accuracy vs training step
    • Ablation effect size vs training step
    • Key metrics showing mechanism emergence
    • Annotate identified phase transitions or inflection points
  • Emergence analysis:
    • When does your concept's circuit form? (specific training step/token count)
    • Is formation sudden or gradual?
    • Do components form in sequence or simultaneously?
    • Comparison to loss curves—does emergence coincide with capability gain?
  • Mechanistic interpretation:
    • Why does the circuit form when it does?
    • Is there evidence of competing mechanisms (grokking-style)?
    • What does emergence timing reveal about your concept's role?
  • Code: Notebook with checkpoint analysis and visualization

Training dynamics reveal whether your circuit is a "natural" solution that forms reliably, or an accidental feature specific to your model's initialization and training procedure.