Week 0: Introduction & Course Overview

Start here: Prologue: At the Edge of Understanding

An essay on why understanding AI matters and what neural mechanics can teach us about the future of human knowledge.

Also see: Lecture Notes | Slides | Finding a Good Research Question

Learning Objectives

1. Course Logistics

Course Overview

Goal: Study how large language models encode non-CS concepts through interpretability research, culminating in papers suitable for submission to NeurIPS or similar venues.

Team Structure: Interdisciplinary teams of ~3 students:

Course Structure

Weeks Focus
0-1 Foundations: Course setup, benchmarking concepts
2-8 Methods: Core interpretability techniques (steering, circuits, probes, SAEs, validation)
9-11 Advanced topics: Attribution, skepticism, and research best practices
12 Final presentations

Grading & Expectations

This is a research course, not a traditional class.

Grading will be based on:

Success = rigorous investigation of your concept, not necessarily positive results. Negative results with careful validation are publishable!

2. The Central Mystery: Are Concepts Different from Words?

Large language models process text, but do they think in words? This course begins with a provocation: the internal representations that drive model behavior may be fundamentally different from the tokens that flow in and out.

Words are surface phenomena. They vary across languages, shift meaning with context, and often fail to capture the abstractions that underlie coherent reasoning. Concepts, by contrast, are the invariants—the stable structures that persist across linguistic transformations. This course asks: can we find these invariants inside neural networks?

The Gap Between Knowledge and Expression

What an AI knows is not always what it says. This gap between internal representation and external behavior has become starkly visible in recent work.

Rager et al. (2025) study DeepSeek models, revealing a striking case of censorship mechanics. The models demonstrably possess knowledge about sensitive topics—their internal representations encode the relevant concepts—yet they refuse to express this knowledge in their outputs. The information exists inside the model; the suppression is a separate mechanism layered on top.

This dissociation between knowing and saying points to a general phenomenon: models may have internal states that diverge systematically from their outputs. If we want to understand what models actually believe, know, or intend, we cannot rely solely on their words. We must look inside.

Concepts as Invariants

The emerging picture suggests that concepts are invariants of the system—internal structures whose functional roles remain unchanged under many transformations. Just as a physical law remains valid regardless of the coordinate system used to express it, a neural concept may persist across languages, phrasings, and contexts.

Examples of concepts that appear to have this invariant character:

These are not isolated neurons but distributed patterns of activation. They are not explicitly labeled in the training data but emerge from the structure of the task. And they are not words—they are what the words point to.

3. Causal Mediation: Finding the Neurons That Matter

How do we identify which internal components are responsible for specific behaviors? One methodology is causal mediation analysis: intervening on internal components to establish their causal role in the model's outputs.

Early Success: The Lamp-Controlling Neuron

Early work on GANs (Generative Adversarial Networks) provides an instructive template. A StyleGAN trained to generate bedrooms—with no supervision beyond imitation of a few million bedroom images—revealed striking internal structure.

Experiment (Bau et al., 2019):

  1. Generate one image that has a lamp, using a fixed random seed z, and manually mark the pixels illuminated by the lamp light
  2. For each of the 9,000 style neurons, regenerate the same image (same seed z) with that neuron removed (set to zero), and rank neurons by their causal effect on the marked pixels
  3. Test the highest-ranked neuron for its effect on other bedroom images

Result: A single neuron turns lights on and off across many different images.

This is surprising because there is nothing in the training data explicitly about switching lights on and off. The training set contains only static bedroom images—no before/after pairs, no labels about lighting. Yet "switching off the lights" emerges as a concept that the neural network learned from the training process.

This leads to the central question of this course: When we train on much more complex data—text spanning all of human knowledge—what concepts do neural networks learn?

4. In-Context Learning and Function Vectors

In-Context Learning: A Form of Metareasoning

In 2020, GPT-3 demonstrated in-context learning (ICL): the ability to learn new tasks from a few examples in the prompt, without any gradient updates.

Example: Translation without training

Prompt:
English: Hello → French: Bonjour
English: Goodbye → French: Au revoir
English: Thank you → French: [model generates: Merci]

This is a form of metareasoning—the model is not just processing text but reasoning about how to reason. Where does this capability come from? It is not explicitly programmed. It emerges from training on next-token prediction, yet manifests as something that looks remarkably like flexible cognition.

Function Vectors: Portable Neural Representations of Tasks

Where is in-context learning encoded inside the model? Research by Todd et al. (2024) reveals a striking answer: when a model performs ICL, it develops compact vector representations of the task itself—function vectors—that can be extracted and transplanted to new contexts.

Experiment (Todd et al., 2024):

  1. Give the model in-context examples demonstrating a function (e.g., antonyms: "arrive:depart, small:big, hot:cold")
  2. Extract the activation pattern from attention head outputs—this is the "function vector" for antonyms
  3. Insert this vector into a completely different context: "The word 'fast' means"

Result: The model outputs "slow"—the antonym—even though no antonym examples appeared in this new prompt.

This is directly analogous to the lamp-controlling neuron, but for abstract functions rather than visual features. Just as the lamp neuron's causal effect transfers across different bedroom images, function vectors transfer across completely different textual contexts. The neural representation learned in one setting causes the same function to execute in another.

A Key Finding: Function vectors work across languages. An antonym function vector extracted from English examples can induce antonym behavior on French or German inputs. The internal representation of "find the opposite" is not tied to any particular language—it is an abstract structure that transcends the tokens.

This reinforces the central thesis: concepts inside LLMs are not words. They are abstract structures—stable under linguistic transformations—that we can study, characterize, and manipulate.

Key Research Threads in LLMs

The methodology of causal mediation generalizes from GANs to language models. Three lines of work illustrate the progress that has been made in localizing and characterizing neural concepts:

Locating Factual Knowledge (Meng et al., 2022): The ROME paper demonstrated that factual associations in GPT-style models are localized in specific MLP layers. By applying causal tracing—corrupting inputs and restoring activations at targeted locations—researchers identified where facts like "The Eiffel Tower is in Paris" are stored. This established that factual knowledge has a discernible address inside the network.

Locating Functions (Todd et al., 2024): This work extended localization from facts to functions. When a model performs in-context learning of an abstract task—translating to French, answering with antonyms, continuing a pattern—where is that function encoded? Function vectors show that task-level abstractions are localizable and, remarkably, transferable: a vector extracted from one context can induce the same function in another.

Locating Mental State Representations (Prakash et al., 2025): This work examined how models represent and track the mental states of agents described in text—theory of mind. It shows that interpretability methods can reach beyond simple factual recall into the representation of genuinely abstract, cognitively rich concepts.

The Pattern: From facts to functions to mental states—each study finds that abstract concepts have neural correlates that can be identified, validated, and manipulated. The methods you will learn in this course are the tools for this kind of discovery.

5. Toward an Atlas of Neural Concepts

If concepts inside LLMs are real, discoverable, and causally important, then we should map them systematically. The vision motivating this course is the creation of an atlas of neural concepts: a comprehensive characterization of the internal structures that drive model behavior across domains.

Why Build an Atlas?

Such an atlas would serve multiple purposes:

Building this atlas requires methods. The works surveyed above—and the techniques you will learn throughout this course—constitute the toolkit for this cartographic project.

What Should the Atlas Contain?

This is an open question. Possibilities include:

The right level of description likely depends on the question being asked. Your projects will contribute to answering this.

Discussion Questions

  1. What do interpretability methods have in common? Causal tracing, function vectors, circuit analysis—each uses different techniques. What makes something a "mechanistic interpretability" approach?
  2. How do we know when we have found a concept? What criteria distinguish a genuine neural concept from a spurious correlation or an artifact of our analysis methods?
  3. What concepts matter most? Given limited research resources, which concepts should we prioritize? What makes a concept important from a scientific perspective? From a safety perspective?
  4. Can models describe their own concepts? If concepts are not words, can language models ever articulate what they internally represent? Or is there a fundamental gap between neural representation and linguistic expression?

6. What Makes a Good Research Question?

Before you invest months of effort into a research project, you should ask: is this the right question? The FINER framework—adapted from clinical research methodology—offers useful criteria: a good research question should be Feasible, Interesting, Novel, Ethical, and Relevant.

For a detailed discussion, see Finding a Good Research Question.

The FINER Framework for Interpretability

Criterion Key Questions
Feasible Can you find a model where the phenomenon appears and where you can see inside? Are there signs of life in the internal representations?
Interesting Does this question excite you? Would your colleagues (in ML and in your domain) consider it important? More interesting concepts tend to generalize—providing insights across many phenomena rather than explaining just one narrow case.
Novel Has this been done before? Domain-specific concepts are vastly understudied—this is your opportunity.
Ethical Would this work primarily enable harm? Interpretability tools can be misused.
Relevant If you answer this question, what changes for the field? For the domains where models are deployed?

The Interdisciplinary Advantage

LLMs are being deployed in medicine, law, education, scientific research, policy analysis. The highest-stakes applications are almost entirely outside computer science. Yet interpretability research remains focused on questions that matter primarily to ML researchers.

This is the gap. This course is forming interdisciplinary teams precisely to address it. The interpretability community has strong intuitions about which questions matter for AI safety and for understanding deep learning. But we have weak intuitions about which questions matter for medicine, for law, for scientific discovery. Your non-CS collaborators have those intuitions.

When the legal scholar on your team gets excited about a research direction, that signal contains information you could not generate yourself. When the biologist identifies a concept that would matter for scientific discovery, that is the kind of question no one else is asking.

Example Project Ideas

Biology: Evolutionary relationships, protein structure encoding, ecological relationships

Linguistics: Politeness strategies, evidentiality, discourse structure

Psychology: Theory of mind, temporal reasoning, causal attribution

Physics/Chemistry: Conservation laws, reaction mechanisms, spatial reasoning

Law: Legal precedent, burden of proof, jurisdiction

Music: Harmonic relationships, rhythmic patterns, musical form

Group Discussion

In your teams, discuss:

  1. What is a fundamental concept in your domain that non-experts often misunderstand?
  2. How would you test if an LLM understands this concept?
  3. What would it mean to "localize" this concept in a neural network?
  4. What failure modes might be interesting? (When does the model get it wrong?)
  5. How could this research benefit your field or AI safety?

7. Assignment: Initial Pitch (Due Week 1)

Part 1: Form Teams

Part 2: Write Your Pitch Document (1-2 pages)

Create a Google Doc in your team's Drive folder. Your pitch should include:

Part 3: Prepare Your Elevator Pitch

Prepare a 5-minute presentation of your idea. In Week 1, each team will pitch to the class, and we will discuss each proposal using the FINER framework.

Submission: Upload your pitch document to your team Google Drive before the Thursday class (Jan 15). We will provide feedback during your presentation.

8. Looking Ahead

Week 1: Foundations—logit lens, intermediate representations, and the vocabulary of mechanistic interpretability

Week 2: Steering—controlling model behavior through activation addition and representation engineering

Week 3: Evaluation Methodology—measuring and evaluating LLM behavior

Week 4: Representation Geometry—PCA visualization, linear directions, and geometric structure

Weeks 5-8: Advanced methods (causal localization, probes, attribution, circuits)

Weeks 9-10: Training dynamics, model editing, and self-description

Weeks 11-13: Paper writing workshops and final presentations

By the end: You'll have a complete research pipeline for characterizing YOUR concept in LLMs, culminating in a paper suitable for submission to a top venue.

Questions?

Reach out to the teaching team:

Office hours: [See course website]

References & Further Reading

On Finding Good Research Questions

Interpretability Papers Mentioned

Accessible Introductions

Supplementary Reading

Project Milestone

This Week: Form Teams and Begin Your Pitch

Form interdisciplinary teams of approximately 3 members: one non-CS PhD student (bringing domain expertise), one CS/ML PhD student (bringing technical ML background), and one Bau Lab member (bringing interpretability expertise).

Begin brainstorming concepts from your domain that might be interesting to study in LLMs. The goal is to identify 2-3 candidate concepts that are:

Getting Started:

The pitch document and presentation are due Thursday of Week 1. Use this week to form your team, explore ideas, and begin writing.