Start here: Prologue: At the Edge of Understanding
An essay on why understanding AI matters and what neural mechanics can teach us about the future of human knowledge.
Also see: Lecture Notes | Slides | Finding a Good Research Question
Goal: Study how large language models encode non-CS concepts through interpretability research, culminating in papers suitable for submission to NeurIPS or similar venues.
Team Structure: Interdisciplinary teams of ~3 students:
| Weeks | Focus |
|---|---|
| 0-1 | Foundations: Course setup, benchmarking concepts |
| 2-8 | Methods: Core interpretability techniques (steering, circuits, probes, SAEs, validation) |
| 9-11 | Advanced topics: Attribution, skepticism, and research best practices |
| 12 | Final presentations |
This is a research course, not a traditional class.
Grading will be based on:
Success = rigorous investigation of your concept, not necessarily positive results. Negative results with careful validation are publishable!
Large language models process text, but do they think in words? This course begins with a provocation: the internal representations that drive model behavior may be fundamentally different from the tokens that flow in and out.
Words are surface phenomena. They vary across languages, shift meaning with context, and often fail to capture the abstractions that underlie coherent reasoning. Concepts, by contrast, are the invariants—the stable structures that persist across linguistic transformations. This course asks: can we find these invariants inside neural networks?
What an AI knows is not always what it says. This gap between internal representation and external behavior has become starkly visible in recent work.
Rager et al. (2025) study DeepSeek models, revealing a striking case of censorship mechanics. The models demonstrably possess knowledge about sensitive topics—their internal representations encode the relevant concepts—yet they refuse to express this knowledge in their outputs. The information exists inside the model; the suppression is a separate mechanism layered on top.
This dissociation between knowing and saying points to a general phenomenon: models may have internal states that diverge systematically from their outputs. If we want to understand what models actually believe, know, or intend, we cannot rely solely on their words. We must look inside.
The emerging picture suggests that concepts are invariants of the system—internal structures whose functional roles remain unchanged under many transformations. Just as a physical law remains valid regardless of the coordinate system used to express it, a neural concept may persist across languages, phrasings, and contexts.
Examples of concepts that appear to have this invariant character:
These are not isolated neurons but distributed patterns of activation. They are not explicitly labeled in the training data but emerge from the structure of the task. And they are not words—they are what the words point to.
How do we identify which internal components are responsible for specific behaviors? One methodology is causal mediation analysis: intervening on internal components to establish their causal role in the model's outputs.
Early work on GANs (Generative Adversarial Networks) provides an instructive template. A StyleGAN trained to generate bedrooms—with no supervision beyond imitation of a few million bedroom images—revealed striking internal structure.
Experiment (Bau et al., 2019):
Result: A single neuron turns lights on and off across many different images.
This is surprising because there is nothing in the training data explicitly about switching lights on and off. The training set contains only static bedroom images—no before/after pairs, no labels about lighting. Yet "switching off the lights" emerges as a concept that the neural network learned from the training process.
This leads to the central question of this course: When we train on much more complex data—text spanning all of human knowledge—what concepts do neural networks learn?
In 2020, GPT-3 demonstrated in-context learning (ICL): the ability to learn new tasks from a few examples in the prompt, without any gradient updates.
Example: Translation without training
Prompt:
English: Hello → French: Bonjour
English: Goodbye → French: Au revoir
English: Thank you → French: [model generates: Merci]
This is a form of metareasoning—the model is not just processing text but reasoning about how to reason. Where does this capability come from? It is not explicitly programmed. It emerges from training on next-token prediction, yet manifests as something that looks remarkably like flexible cognition.
Where is in-context learning encoded inside the model? Research by Todd et al. (2024) reveals a striking answer: when a model performs ICL, it develops compact vector representations of the task itself—function vectors—that can be extracted and transplanted to new contexts.
Experiment (Todd et al., 2024):
Result: The model outputs "slow"—the antonym—even though no antonym examples appeared in this new prompt.
This is directly analogous to the lamp-controlling neuron, but for abstract functions rather than visual features. Just as the lamp neuron's causal effect transfers across different bedroom images, function vectors transfer across completely different textual contexts. The neural representation learned in one setting causes the same function to execute in another.
A Key Finding: Function vectors work across languages. An antonym function vector extracted from English examples can induce antonym behavior on French or German inputs. The internal representation of "find the opposite" is not tied to any particular language—it is an abstract structure that transcends the tokens.
This reinforces the central thesis: concepts inside LLMs are not words. They are abstract structures—stable under linguistic transformations—that we can study, characterize, and manipulate.
The methodology of causal mediation generalizes from GANs to language models. Three lines of work illustrate the progress that has been made in localizing and characterizing neural concepts:
Locating Factual Knowledge (Meng et al., 2022): The ROME paper demonstrated that factual associations in GPT-style models are localized in specific MLP layers. By applying causal tracing—corrupting inputs and restoring activations at targeted locations—researchers identified where facts like "The Eiffel Tower is in Paris" are stored. This established that factual knowledge has a discernible address inside the network.
Locating Functions (Todd et al., 2024): This work extended localization from facts to functions. When a model performs in-context learning of an abstract task—translating to French, answering with antonyms, continuing a pattern—where is that function encoded? Function vectors show that task-level abstractions are localizable and, remarkably, transferable: a vector extracted from one context can induce the same function in another.
Locating Mental State Representations (Prakash et al., 2025): This work examined how models represent and track the mental states of agents described in text—theory of mind. It shows that interpretability methods can reach beyond simple factual recall into the representation of genuinely abstract, cognitively rich concepts.
The Pattern: From facts to functions to mental states—each study finds that abstract concepts have neural correlates that can be identified, validated, and manipulated. The methods you will learn in this course are the tools for this kind of discovery.
If concepts inside LLMs are real, discoverable, and causally important, then we should map them systematically. The vision motivating this course is the creation of an atlas of neural concepts: a comprehensive characterization of the internal structures that drive model behavior across domains.
Such an atlas would serve multiple purposes:
Building this atlas requires methods. The works surveyed above—and the techniques you will learn throughout this course—constitute the toolkit for this cartographic project.
This is an open question. Possibilities include:
The right level of description likely depends on the question being asked. Your projects will contribute to answering this.
Before you invest months of effort into a research project, you should ask: is this the right question? The FINER framework—adapted from clinical research methodology—offers useful criteria: a good research question should be Feasible, Interesting, Novel, Ethical, and Relevant.
For a detailed discussion, see Finding a Good Research Question.
| Criterion | Key Questions |
|---|---|
| Feasible | Can you find a model where the phenomenon appears and where you can see inside? Are there signs of life in the internal representations? |
| Interesting | Does this question excite you? Would your colleagues (in ML and in your domain) consider it important? More interesting concepts tend to generalize—providing insights across many phenomena rather than explaining just one narrow case. |
| Novel | Has this been done before? Domain-specific concepts are vastly understudied—this is your opportunity. |
| Ethical | Would this work primarily enable harm? Interpretability tools can be misused. |
| Relevant | If you answer this question, what changes for the field? For the domains where models are deployed? |
LLMs are being deployed in medicine, law, education, scientific research, policy analysis. The highest-stakes applications are almost entirely outside computer science. Yet interpretability research remains focused on questions that matter primarily to ML researchers.
This is the gap. This course is forming interdisciplinary teams precisely to address it. The interpretability community has strong intuitions about which questions matter for AI safety and for understanding deep learning. But we have weak intuitions about which questions matter for medicine, for law, for scientific discovery. Your non-CS collaborators have those intuitions.
When the legal scholar on your team gets excited about a research direction, that signal contains information you could not generate yourself. When the biologist identifies a concept that would matter for scientific discovery, that is the kind of question no one else is asking.
Biology: Evolutionary relationships, protein structure encoding, ecological relationships
Linguistics: Politeness strategies, evidentiality, discourse structure
Psychology: Theory of mind, temporal reasoning, causal attribution
Physics/Chemistry: Conservation laws, reaction mechanisms, spatial reasoning
Law: Legal precedent, burden of proof, jurisdiction
Music: Harmonic relationships, rhythmic patterns, musical form
In your teams, discuss:
Part 1: Form Teams
Part 2: Write Your Pitch Document (1-2 pages)
Create a Google Doc in your team's Drive folder. Your pitch should include:
Part 3: Prepare Your Elevator Pitch
Prepare a 5-minute presentation of your idea. In Week 1, each team will pitch to the class, and we will discuss each proposal using the FINER framework.
Submission: Upload your pitch document to your team Google Drive before the Thursday class (Jan 15). We will provide feedback during your presentation.
Week 1: Foundations—logit lens, intermediate representations, and the vocabulary of mechanistic interpretability
Week 2: Steering—controlling model behavior through activation addition and representation engineering
Week 3: Evaluation Methodology—measuring and evaluating LLM behavior
Week 4: Representation Geometry—PCA visualization, linear directions, and geometric structure
Weeks 5-8: Advanced methods (causal localization, probes, attribution, circuits)
Weeks 9-10: Training dynamics, model editing, and self-description
Weeks 11-13: Paper writing workshops and final presentations
By the end: You'll have a complete research pipeline for characterizing YOUR concept in LLMs, culminating in a paper suitable for submission to a top venue.
Reach out to the teaching team:
Office hours: [See course website]
This Week: Form Teams and Begin Your Pitch
Form interdisciplinary teams of approximately 3 members: one non-CS PhD student (bringing domain expertise), one CS/ML PhD student (bringing technical ML background), and one Bau Lab member (bringing interpretability expertise).
Begin brainstorming concepts from your domain that might be interesting to study in LLMs. The goal is to identify 2-3 candidate concepts that are:
The pitch document and presentation are due Thursday of Week 1. Use this week to form your team, explore ideas, and begin writing.