Northeastern University, Spring 2026
Modern AI systems are powerful but opaque: even their creators do not fully understand what the billions of artificial neurons are doing inside. This class teaches methods for probing neural networks to uncover what concepts they have learned and where those concepts are encoded.
Interdisciplinary teams of a domain-expert PhD student, one or more CS/ML graduate students, and a Bau Lab mentor will aim to produce publication-quality research papers studying how LLMs encode concepts from fields outside CS, an area often neglected in interpretability research. Teams will target venues like NeurIPS, ICLR, or ICML. Prerequisites: linear algebra, probability, deep learning, Python, PyTorch.
Time: Tuesday 11:45 am – 1:25 pm, Thursday 2:50 pm – 4:30 pm
Location: Hayden Hall 321
| Week | Tuesday | Thursday | Assignments / Notes |
|---|---|---|---|
| 0 | Introduction Course structure, goals, and introduction to mechanistic interpretability. Thu Jan 8 |
Form teams; establish team Google Drive; brainstorm 2-3 candidate concepts | |
| 1 | Foundations Logit lens, intermediate representations, and the vocabulary of mechanistic interpretability. Tue Jan 13 Read: Primer, Logit Lens, Latent Language |
Project Pitches Thu Jan 15 |
Hand in pitch Google Doc; each team presents their pitch |
| 2 | Steering Controlling model behavior by manipulating distributed neural representation vectors. Tue Jan 20 Read: Piantadosi, Superposition, ITI |
Project Updates Thu Jan 22 |
First plot-a-thon slides; logit lens or Neuronpedia explorations |
| 3 | Evaluation Methodology Creating evaluation datasets; measuring LLM behavior with cloze tasks, LLM-as-judge, model-written evaluations. Tue Jan 27 Read: Model-Written Evals, LLM-as-Judge, LAMA |
Project Updates Thu Jan 29 |
Plot-a-thon: benchmark findings and model choices; create GitHub |
| 4 | Representation Geometry What does a concept look like? PCA visualization, linear directions, and geometric structure. Tue Feb 3 Read: Geometry of Truth, Sentiment, Vector Arithmetic |
Project Updates Thu Feb 5 |
Push interactive visualizations to project website |
| 5 | Causal Localization Where are facts and functions computed? Causal tracing, activation patching, and function vectors. Tue Feb 10 Read: ROME, Function Vectors, Entity Tracking |
Project Updates Thu Feb 12 |
Start Overleaf; causal mediation experiments |
| 6 | Probes Is the information there? Training classifiers to decode concepts, plus methodological pitfalls. Tue Feb 17 Read: TCAV, Probing Control Tasks |
Project Updates Thu Feb 19 |
Reproducible experiment infrastructure; trained concept probe |
| 7 | Attribution Which input tokens matter? Integrated gradients, faithful attribution, and RAG analysis. Tue Feb 24 Read: Integrated Gradients, MIRAGE |
Project Updates Thu Feb 26 |
Draft intro framing your research story; input attribution |
| — | Spring Break Mar 2–6 |
No class | |
| 8 | Circuits Reverse-engineering end-to-end algorithms: induction heads, ACDC, and automated circuit discovery. Tue Mar 10 Read: Induction Heads, ACDC, Faithfulness |
Project Updates Thu Mar 12 |
Refine hypotheses; repeat experiments; circuits analysis |
| 9 | Training Dynamics & Model Editing How circuits emerge during training; surgical fact editing with MEMIT. Tue Mar 17 Read: Grokking, MEMIT, In-Context Algebra |
Project Updates Thu Mar 19 |
Triangulate: scale or diversity experiments |
| 10 | Human Understanding & Self-Description Can interpretability help humans? Can models interpret themselves? Tue Mar 24 Read: Bridging the Human-AI Gap, Patchscopes, Neologism |
Project Updates Thu Mar 26 |
More triangulation; finalize experiment results |
| 11 | How to Write a Paper Tue Mar 31 |
Peer Review Workshop Thu Apr 2 |
Introduction + methods draft for peer review |
| 12 | Plot-a-thon: Results Discussion Tue Apr 7 |
Peer Review Workshop Thu Apr 9 |
Complete paper draft for peer review |
| 13 | Guest Lecture Tue Apr 14 |
Paper Editing Workshop Thu Apr 16 |
Editing and refinement |
| 14 | Final Presentations Tue Apr 21 |
Final paper due Wed Apr 22 | |
| Names | Day | Time | Location |
|---|---|---|---|
| David | TBD | TBD | TBD |
| Nikhil | TBD | TBD | TBD |