..

Neel Nanda on mechanistic interpretability

Neel Nanda is a researcher at Google DeepMind working on mechanistic interpretability. He is also known for his YouTube channel where he explains what is going on inside of neural networks to a large audience.

In this conversation, we discuss what is mechanistic interpretability, how Neel got into it, his research methodology, his advice for people who want to get started, but also papers around superposition, toy models of universality and grokking, among other things.

(Note: as always, conversation is ~2h long, so feel free to click on any sub-topic of your liking in the outline below and then come back to the outline by clicking on the green arrow)

Contents

Highlighted Quotes

(See the Lesswrong post for discussion)

An Informal Definition Of Mechanistic Interpretability

It’s kind of this weird flavor of AI interpretability that says, “Bold hypothesis. Despite the entire edifice of established wisdom and machine learning, saying that these models are bullshit, inscrutable black boxes, I’m going to assume there is some actual structure here. But the structure is not there because the model wants to be interpretable or because it wants to be nice to me. The structure is there because the model learns an algorithm, and the algorithms that are most natural to express in the model’s structure and its particular architecture and stack of linear algebra are algorithms that make sense to humans. (full context)

Three Modes Of Mechanistic Interpretability Research: Confirming, Red Teaming And Gaining Surface Area

I kind of feel a lot of my research style is dominated by this deep seated conviction that models are comprehensible and that everything is fundamentally kind of obvious and that I should be able to just go inside the model and there should be this internal structure. And so one mode of research is I just have all of these hypotheses and guesses about what’s going on. I generate experiment ideas for things that should be true if my hypothesis is true. And I just repeatedly try to confirm it.

Another mode of research is trying to red team and break things, where I have this hypothesis, I do this experiment, I’m like, “oh my God, this is going so well”, and then get kind of stressed because I’m concerned that I’m having wishful thinking and I try to break it and falsify it and come up with experiments that would show that actually life is complicated.

A third mode of research is what I call “trying to gain surface area” where I just have a system that I’m pretty confused about. I just don’t really know where to get started. Often, I’ll just go and do things that I think will get me more information. Just go and plot stuff or follow random things I’m curious about in a fairly undirected fuzzy way. This mode of research has actually been the most productive for me. […]

You could paraphrase them as, “Isn’t it really obvious what’s going on?”, “Oh man, am I so sure about this?” and “Fuck around and find out”. (full context)

Strong Beliefs Weakly Held: Having Hypotheses But Being Willing To Be Surprised

You can kind of think of it as “strong beliefs weakly held”. I think you should be good enough that you can start to form hypotheses, being at the point where you can sit down, set a five minute timer and brainstorm what’s going on and come up with four different hypotheses is just a much, much stronger research position than when you sit down and try to brainstorm and you come up with nothing. Yeah, maybe having two hypotheses is the best one. You want to have multiple hypotheses in mind.

You also want to be aware that probably both of them are wrong, but you want to have enough engagement with the problem that you can generate experiment ideas. Maybe one way to phrase it is if you don’t have any idea what’s going on, it’s hard to notice what’s surprising. And often noticing what’s surprising is one of the most productive things you can do when doing research. (full context)

On The Benefits Of The Experimental Approach

I think there is a strong trend among people, especially the kind of people who get drawn to alignment from very theory based arguments to go and just pure theory craft and play around with toy models and form beautiful, elegant hypotheses about what happens in real models. […] And there’s a kind of person who will write really detailed research proposals involving toy models that never has the step of like “go and make sure that this is actually what’s happening in the real language models we care about”. And I just think this is just a really crucial mistake that people often make. And real models are messy and ugly and cursed. So I vibe, but also you can’t just ignore the messy, complicated thing that’s the ultimate one we want to understand. And I think this is a mistake people often make.

The second thing is that, I don’t know, mechanistic interpretability seems hard and messy, but like it seems of embarrassing how little we’ve tried. And it would just be so embarrassing if we make AGI and it kills everyone and we could have interpreted it, we just didn’t try hard enough and didn’t know enough to get to the point where we could look inside it and see the ‘press here to kill everyone’ (full context)

Intro

Michaël: I’m here today with Neel Nanda. Neel is a research engineer at Google DeepMind, most well known for his work on mechanistic interpretability, grokking, and his YouTube channel explaining what is going on inside of neural networks to a large audience. YouTube channel called Neel Nanda.

Neel: I’m very proud of that name.

Michaël: I remember meeting you four years ago at some party in London, and at the time you were studying mathematics in Cambridge, and you already had some kind of outreach where there was classes of mathematics at Cambridge, and the classes with Neel Nanda, where you were putting them on YouTube and explaining additional stuff from the teachers. And now, fast forward to 2023, you’ve done some work with Entropic, FHI, CHI, Google DeepMind. You co-authored multiple papers around mechanistic interpretability and grokking, which we’ll talk about in the episode yet. Thanks Neel for coming to the show. It’s a pleasure to have you.

Neel: Thanks for having me on.

Why Neel Started Doing Walkthroughs Of Papers On Youtube

Michaël: Let’s talk about more your YouTube experience. You have a YouTube channel where you talk about ML research, that you call ML walkthroughs of paper, and I think you are one of the only person doing this.

Neel: Yeah.

Michaël: Why did you start this? What was the story behind it?

Neel: The credit to this actually goes to Nick Cammarata, who’s an OpenAI interpretability researcher, who has this video called Let’s Play: Building Blocks of Interpretability, which is the name of this paper. And he was like, “Let’s do a let’s play where I record myself reading through and playing with the interactive graphics.” And was commenting on how this was just really good, and people seemed to really like it, and the effort was trivial compared to actually doing it, actually writing the paper.

Neel: And I was like, “Ah, seems fun.” And then one evening when I was in Constellation, this Bay Area co-working space, I was like, “Ah, it’d be kind of good to go do this for a mathematical framework [for transformer circuits]. I have so many hot takes. I think a mathematical framework [for transformer circuits] is a really good paper that most people don’t understand.” And decided to go sit down in a call booth and ramble.

Neel: This was incredibly low prep and low production quality, and involves me talking into a MacBook microphone and drawing diagrams on my laptop trackpad, just rambling in a call booth for three hours until 3am. And it was just really popular. Got retweeted by Dominic Cummings. It got, is it 5,000 views? And I’ve had multiple people tell me, “Yeah, I read that paper. Didn’t make any sense, but I listened to a walkthrough and it made loads of sense.” And people have explicitly recommended you should read the paper and then watch Neel’s walkthrough and then you will understand that paper.

Neel: And I was like, “Well, that was incredibly easy and fun. Guess I should do this more.” More recently I’ve started doing them more as interviews because, I don’t know, empirically, if I tell someone, “Let’s sit down and read through this paper together and chat,” then this will work and I will turn up at a time and do it. If I’m like, “I should, of my own volition, sit down and monologue through a paper,” it’s way higher effort and I’m way more likely to procrastinate on doing it. People still watch them. I don’t know why.

Michaël: Well, I watch them. I even listen to them while I’m at the gym. I have Lawrence Chan and Neel Nanda talking about grokking while I’m doing bench press.

Neel: But wait, there’s so many diagrams. This is the entire reason it’s not a podcast. It’s because we’re looking at diagrams and discussing.

Michaël: Well I guess some of it is you guys talking about numerical instability and why is it so hard to deal with very low loss, and why 3e-9 is important.

Neel: 1.19e-7, please.

Michaël: What’s the paper you talked about on your first marathon at 3am?

Neel: Oh, that was a mathematical framework for transformer circuits, which is still, in my opinion, the best paper I’ve been privileged to be part of. That’s this anthropic paper we might discuss that’s basically a mathematical framework for how to think about transformers and how to break down the kinds of algorithms they can implement and just lays out a lot of the foundational concepts you just kind of need to have in your head if you’re going to have any shot at doing mechanistic interpretability in a principled way. Maybe I should define mechanistic interpretability before I start referencing it

An Introduction To Mechanistic Interpretability

What is Mechanistic Interpretability?

Michaël: Yeah, Neel Nanda, what is mechanistic interpretability?

Neel: Sure, so mechanistic interpretability is the study of reverse engineering via the algorithm learned by a trained neural network. It’s kind of this weird flavor of AI interpretability that says, “Bold hypothesis. Despite the entire edifice of established wisdom and machine learning, saying that these models are bullshit, inscrutable black boxes, I’m going to assume there is some actual structure here. But the structure is not there because the model wants to be interpretable or because it wants to be nice to me. The structure is there because the model learns an algorithm, and the algorithms that are most natural to express in the model’s structure and its particular architecture and stack of linear algebra are algorithms that make sense to humans.” And it’s the science of how can we rigorously reverse engineer the algorithms learned, figure out what the algorithms are and whether this underlying assumption that they’re a structure makes any sense at all, and do this rigorously without tricking ourselves. Because as I’m sure will be a theme, it’s so, so easy to trick yourself.

Michaël: When you say the cipher, the algorithms inside of neural networks, what is an example of some kind of algorithm we can see inside of the weights?

Modular Addition: A Case Study In Mechanistic Interpretability

Neel: Sure. So one example, which I’m sure we’re going to get to more later on, is this paper “Progress Measures for Grokking via Mechanistic Interpretability”, where I looked into how a one-layer that’s a particular kind of neural network does modular addition. And what I found is that it did modular addition by thinking of it as rotations around the unit circle, where if you compose two rotations, you’re adding the angles, which gets you And because it’s a circle, this means it’s mod 360 degrees. You get modularity for free if you choose your angles at the right frequency. And I found that you could just go inside the model and see how the inputs were represented as trigonometry terms to parameterize the rotations, and how it used trigonometry identities to actually do the composition by multiplying together different activations.

Michaël: Yeah, I think that’s one of the most salient examples of your work. And I think you posted it on Twitter, like, “Oh, I managed to find the composition of modular addition in cosines and sinuses.” Everyone lost their mind.

Neel: That was my first ever tweet, and that is the most popular thing I’ve ever tweeted. I just peaked at the start. It’s all been downhill since then.

Michaël: Well, I feel you’re still going upwards on YouTube, and you’ve done all these podcasts and everything. But yeah, I think that was very interesting. I’m curious if there are other examples of computation that we see inside of neural networks, or is that mostly the most well-known case?

Neel: It’s the one that I’m most confident actually is there, and in my opinion, it is the prettiest.

Induction Heads, Or Why Nanda Comes After Neel

Neel: Another example is that of induction heads, though this is going to get a bit more involved to explain. So a feature of language is that it often contains repeated subsequences. So models GPT-3 are trained to predict the next word, and given a word “Neel,” if they want to predict what comes next, it is unfortunately not that likely that “Nanda” comes next. But if “Neel Nanda” has occurred five times in the text so far, “Nanda” is now a very good guess what comes next, because “Oh, it’s a text about Neel Nanda.”

Michaël: It’s a podcast about Neel Nanda. It’s a transcript.

Neel: Exactly. “Hey, GPT-5.” This is actually a really, really common structure. You just could not know “Nanda” came next without searching the previous context and seeing that “Nanda” came after “Neel.” Models are just really good at this. They’re so good at it that they can actually predict, if you just give them completely randomly generated text, just randomly generated tokens, and then add some repetition, models are perfectly capable of dealing with that, which is kind of wild, because this is so far outside what they see in training. And it turns out they learn an algorithm that we call induction, notably implemented by these things we call induction heads.

Neel: In induction, essentially, there’s a head which learns to look, it learns to look from the token “Neel” to the token “Nanda,” that is, the token that came after an earlier occurrence of “Neel.” And it looks at “Nanda,” and then it predicts that whatever it’s looking at comes next. And this is a valid algorithm that will result in it predicting “Nanda.” And the reason this is a hard thing for a model to do is that the way transformers move information between positions is via this mechanism called attention, where each token gets two bits, gets three bits of information, a key which says “Here is the information I have to provide,” a query which says “Here’s the kind of information that I want,” and a value which says “Here is the actual information I will give you.” And the queries and keys are used to match things up, to find the token that is most relevant to the destination it wants to bring the information to.

Neel: And importantly, this is all symmetric: from the perspective of the query of token 17, it looks at the key of token 16, the key of token 15, until the key of token 1, all kind of separately. There’s no relationship between the key of token 15 and the key of token 16. It can’t tell that they’re next to each other. It’s just shuffles everything into an enormous mess and then hunts for the keys that most matter. Because the key for token 16 has no relationship for the key to token 15, it’s all kind of shuffled up from the model’s perspective. It’s really hard to have a key that says “The token before me was Neel.” And the model needs to actually do some processing to first move the information that the previous token was Neel along by one, and then compute a more complicated key that says “The thing before me was Neel.” So the attention head knows how to properly identify Nanda.

Detecting Induction Heads In Basically Every Model

Neel: The other interesting thing about induction heads is these are just a really big deal in models. They occur in basically every model we’ve looked at, up to 70 billion parameters. But we found them by looking at a two-layer attention only model, which is just one of the best vindications thus far that studying tiny toy models can teach us real things. It’s a very cool result.

Michaël: I guess induction heads was this paper by Anthropic, maybe 2021, 2022? And in the paper, they might have studied smaller models, and so you’re saying they checked as well for 70 billion parameter models, or is this later evidence?

Neel: What actually happened is, so we published two papers around the late 2021, early 2022, and we found the induction heads in the first one, the mathematical framework, in the context of two-layer attention only models. But we were in parallel writing a sequel paper in context learning and induction heads where we looked up to 13 billion parameter models. 70 billion is just, I looked in Chinchilla and I have them. May as well just increase the number I’m allowed to get during talks.

Michaël: The 13 billion is the actual number in the paper, but to go higher you need to actually talk to Neel Nanda and see what he’s doing in the weekends.

Neel: Yes, I just really need to convince someone at OpenAI to go look in GPT-4 so I can be “the biggest model in existence has it guys, it’s all great”.

Michaël: When you say Chinchilla, I think it’s a paper by DeepMind, right? So you have access to it, but it’s not public, right?

Neel: It’s not open source, no.

Michaël: I think that’s an interesting thing, you to do research on the side. You don’t just do research during the day, but you also do a bunch of mentoring and a bunch of weekend marathons where you try to explore things. It’s so fun!

Neel: And I’m so much better at procrastinating than doing my actual job.

Michaël: It’s great. Yeah.

Neel’s Backstory

How Neel Got Into Mechanistic Interpretability

Michaël: I’m curious, how did you start getting so much in love with mechanistic interpretability, which we’ll maybe call mechanistic interpretability moving forward? Because four years ago you were maybe doing alignment work at different orgs, but maybe not that much interested. What was the thing that made your brain be “oh, this is interesting”?

Neel: Yeah. So, I don’t know, I kind of just feel we have these inscrutable black boxes that can do incredible things that are becoming increasingly important in the world. We have no idea how they work. And then there is this one tiny subfield led by this one dude, Chris Olah, that feels like it’s actually getting some real insights into how these things work. And basically no one is working in this or taking this seriously. And I can just go in and in a year be one of the top researchers in mechanistic interpretability in the world. It’s just “What? This is so fun! This is incredible! Why isn’t everyone doing this?” You could just look inside them and there are answers. It’s also incredibly, incredibly cursed and messy and horrible, but there is real structure here. It’s so pretty.

Michaël: There’s a beautiful problem and there’s five people working on it and everyone is super smart and doing a bunch of crazy things. And there’s only five people so you can just join them and look at this thing by yourself.

Neel: Yes. We’re now at 30 to 50 though. So your time is running out. If you’re hearing this and you want to get it on the ground floor, there’s not that much time left. We’ll definitely have solved this thing next year. It’ll be easy.

Neel’s Journey Into Alignment

Michaël: Would you say you became interested because of his links to alignment and you wanted to solve alignment somehow? When did you get interested in alignment? Was it before that?

Neel: Yeah. I think so. So maybe I want to distinguish this into two separate claims. There’s when did I decide I was excited about working on alignment? And then there’s when did I decide I wanted to work on alignment? Where I feel I decided that I wanted to work on alignment a lot much earlier than I actually became excited about working on alignment. Where I’ve been involved in EA for a while. I read Harry Potter and the Method of Rationality when I was 14. I hung out in Lesswrong a bunch. I read a bunch of the early AI safety arguments and they just kind of made sense to me. And I spent a lot of time hanging out with EAs, a lot of my friends, this stuff mattered.

Neel: And honestly, I spent quite a long time working, hanging out in this space before I internalized that I personally could probably go and do something useful rather than alignment being this weird abstract thing that might matter in 100 years but the only thing people did today was prove random useless theorems about. So I managed to figure that one out towards the end of my degree. I graduated in about 2020 from undergrad at Cambridge in maths for context. I gradually realized, wait, shit, something in alignment probably matters. This seems a really important problem. Maybe I should go try to figure this out.

Neel: And then I was actually going to go work in finance and then at the last minute was like, hmm, I don’t really want to go work in alignment but I don’t have a good reason for this. I just kind of have this like, ugh, what even is alignment, man? This seems kind of messy. And this seems a bad reason. And also I have no idea what working in alignment even means. I haven’t actually checked. And maybe I should go check. And this in hindsight was a much easier decision than I thought it was. So I then took a year and did a bunch of back-to-back internships at some different alignment orgs: the Future of Humanity Institute doing some mathsy theory stuff, Google DeepMind, or back then just DeepMind, doing some fairness and robustness work, and the Center for Human Compatible AI doing some interpretability work. And all of these were a bit of a mess for a variety of different reasons. And nothing I did really clicked. But I also just spent a lot of time hanging out around alignment people, started to become a lot more convinced that something here mattered and I could go and actually do something here that was useful.

Neel: And I then lucked out and got an offer to go work with Chris Olah at Anthropic. And at the time, I think I massively underweighted what an amazing opportunity this was, both because I kind of underweighted, like, holy shit, Chris Olah is a genius who founded a groundbreaking research field and will personally mentor you. This is such a good opportunity. And also, I think I was underweighting just how important getting excited about a thing was and how it just seemed… I don’t know. I had some concerns that mechanistic interpretability would be too narrow and detail-oriented, tedious for me to get excited about, which I think were reasonable concerns, and I’m just not that excited about the detail-oriented parts. But fortunately, there’s enough of them that it’s fine. But I eventually decided to accept the offer. My reasoning wasn’t great, but I made the correct decision, so who cares?

Neel: And yeah, I don’t really know if there was a point where I fell in love. I think there were some points early on where I felt I had some real insights. I came up with the terms Q, K, and V composition as part of helping to write the mathematical framework paper, and it felt I actually got some positive feedback from Chris that I’d made a real research contribution and started to feel less insecure and more like, “Oh wow, I can actually contribute.” Though I think it only really became properly clear to me that I wanted to pursue this long-term after I left Anthropic and had some research success doing this work on my own.

Neel: I did, notably, this Progress Measures for Grokking via Mechanistic Interpretability paper and just had a week where I was incredibly nervous by understanding what was up with modular addition, had this conviction that obviously the Grokking paper was a great place to apply mechanistic interpretability that just no one was trying, and then was vindicated when I was indeed correct and got some research results that everyone else agreed was cool that no one had done, and was just like, “Wow, I actually properly led a research thing. I can’t be insecure about this, and I can’t just be like, ‘Ah, really this was someone else’s thing and I just helped a bit.’ This is my research thing that I owned.” That I think was cool. I’m insecure about how cool it was, but that was probably the moment where I was most clearly like, “I want to do this.”

Enjoying Mechanistic Interpretability And Being Good At It Are The Main Multipliers

Michaël: I think there’s a story about whether alignment is important at all. Is it a real thing that I can make progress on? Are people actually doing research on this productively? Is this a real problem to solve? Is it urgent? Then there’s, “Can I do anything about it? Is there anything I can do that I feel excited about it?” The one-year internships are more like, “Oh, is there something going on?” You might not be sure that you can do research, but the moment where you realize that you can do research was the Chris Olah contribution where you’re like, “Oh, I can do some stuff.” Then the Modular Addition is like, “Oh, I’m actually good at this. I’m pretty good at this. Maybe I have a superpower in this. Maybe I should probably do this full-time or something.”

Neel: Yeah, pretty much. One thing I think is a bit overrated is the thing I was initially trying to do of find the most important thing and go work on that, where A, I think this is kind of doomed because it’s just really complicated and confusing. And B, I just feel the fact that I like mechanistic interpretability and I’m good at it is just such a ridiculous multiplier of my productivity. And that I just can’t really imagine doing anything else, even if I became convinced that this angle and scalable oversight was twice as impactful as mechanistic interpretability. Right.

Michaël: So you’re saying that basically you just enjoy doing it a lot and it’s good that it’s impactful, but most of your weight is on what is making you productive and excited.

Neel: Yeah. And I think people should just generally wait, find the thing they’re excited about more than I think many people do, because many people are EAs and thus overly self-flagellating.

Michaël: Yeah, if someone watching this doesn’t know what EA means, it’s “effective altruism.” Because otherwise you’re going to be lost.

Neel: Do you have audience members who don’t know what EA is?

Michaël: There’s thousands of people on YouTube that like, there’s probably at least 10% or 20% that don’t. On my video with Connor, a lot of comments were like, “What the hell is the EA thing?”

Neel: Hello, today is lucky 10,000.

Michaël: I don’t know. there’s 10,000 people in the world that maybe are part of the effective altruism movement. So I wouldn’t be surprised if like, I would be very surprised if everyone was watching my videos.

Twitter Questions

How Is AI Alignment Work At DeepMind?

Michaël: I also ask people on Twitter to ask you some questions. Dominic was curious about your career path and how did you go into DeepMind from being an independent researcher or how is DeepMind alignment work? But I guess you already answered the first part. So yeah, what is DeepMind alignment work like?

Neel: Pretty fun. I’m not sure it’s actually that different from just any other kind of alignment works, which makes me not sure how to answer that question. I think in particular for mechanistic interpretability, I’m personally pretty excited about most of our research on open source models and generally trying to make it so that we can be as scientifically open as possible. obviously one of the main benefits of doing alignment work in an industry lab is you get access to proprietary models and you get access to proprietary levels of compute.

Scalable Oversight

Neel: Honestly for mechanistic interpretability, I think both of these advantages are significantly less important than for say the scalable oversight team where you just can’t do it if you don’t have cutting edge models.

Michaël: Can you just define quickly what scalable oversight for people who don’t know?

Neel: Scalable oversight is this idea, you can kind of think of it as RLHF++. No, people don’t know what RLHF means. The way we currently train these frontier language models chat-gpt is the system called reinforcement learning and human feedback, where you have it do something and you then ask the system, the system does something and then a human rater gives it a thumbs up or a thumbs down depending on whether it was good and use this technique called reinforcement learning to tell it do more of the stuff that gets you thumbs up and less the stuff that gets you thumbs down. And today this works kind of fine, but this just pretty obviously has lots of conceptual issues because humans are dumb and humans aren’t experts in everything and there’s often subtle problems in models. And if you just give it a thumbs up or a thumbs down on a couple of seconds of inspection, then you can easily reward things that are superficially good, but not actually good and things that. And this is all just like, yeah, kind of an issue.

Neel: What ends up happening is that this is just probably not going to scale. And scalable oversight is “what are forms of giving feedback to models that might actually scale to things that are smarter and better”. And it covers things rather than judging the output of the model, you could have two models discuss something and a human rates the one that thinks it made the best argument, which is an idea called AI Safety via debate. You might have AIs help humans give feedback, critiquing the output of another AI. That’s the kind of thing that happens in scalable oversight. I kind of think about it as coming up with the kinds of schemes that as the AI gets better, our ability to give them oversight gets better. And where accordingly, most of the ideas revolve around things getting the AI to help you give feedback to the AI.

Most Ambitious Degree Of Interpretability With Current Transformer Architectures

Michaël: This other question from Siméon Campos was in the podcast before. He’s asking what is the most ambitious degree of interpretability that you expect to get with current transformer architectures?

Neel: Not entirely sure how to answer the question. Is the spirit how far can interpretability go?

Michaël: Yeah. How far can we actually go?

Neel: My guess is that we can go pretty far. My guess is that we could in theory take GPT-4’s behavior and be able to answer, be able to take anything GPT-4 does and answer most reasonable questions we could care about around like, why did it do this? Is it capable of this behavior? I think we’re much more bottlenecked by sucking at interpretability than the models being inherently uninterpretable. And it’s a fuzzy question because plausibly the model is kind of bad, but it’s kind of a fuzzy question because maybe it’s really cursed in this way. And you can imagine a different model architecture that makes it less cursed, but we can deal with the cursiveness if we were just smart enough and try it hard enough.

Michaël: So that’s the part about “are humans capable of doing this with current coordination or our brains?” And then there’s like, is it actually possible? On paper?

Neel: Yes. Oh yeah. And then there’s the other question of the fear of alien abstractions that models are implementing algorithms that are just too complicated or abstractions we haven’t thought of. And my guess is we’re just very far off this being an issue and the even up to human level systems, this is probably not going to be a dramatically big deal just because so much of the stuff the models are doing is just not conceptually that hard. But I’m sure we’re going to eventually have models that figured out the equivalent of 22nd century quantum mechanics where I expect to be kind of screwed.

Research Methodology And Philosophy For Mechanistic Interpretability

To Understand Neel’ Methodology, Watch The Research Walkthroughs

Michaël: Yeah. If we need to have another quantum mechanics breakthrough before we understand neural networks, it’s maybe a tough bet. I guess to answer the question of how ambitious can we be. I think we can just go to the angle of like, how do you actually look at weights and how do you actually do this work? Because I think it’s kind of an open question of like, how does Neel Nanda stare at weights and come up with new computation or new theories?

Neel: Sure. So for people who are curious about this, I do in fact have seven hours of research walkthroughs on my channel where I record myself doing research and staring at models weights and trying to understand them. And you can just go watch that, watch those. I also have another 16 hours of those unreleased that I should really get around to putting out sometime. Because turns out a great productivity hack is just announcing to people, I’m going to hang out in a Zoom call and do research for the next several hours. I’ll record it, come watch.

Michaël: So a short, long answer, just watch 16 hours of YouTube video.

Neel: Exactly. Like, why does anyone need any other kind of answer?

Three Modes Of Research: Confirming, Red Teaming And Gaining Surface Area

Neel: Trying to engage with the question, I kind of feel a lot of my research style is dominated by this deep seated conviction that models are comprehensible and that everything is fundamentally kind of obvious and that I should be able to just go inside the model and there should be this internal structure. And so one mode of research is I just have all of these hypotheses and guesses what’s going on. I generate experiment ideas for things that should be true if my hypothesis is true. And I just repeatedly try to confirm it. Other modes of research, trying to red team and break things where I have this hypothesis, I do this experiment, I’m like, “oh my God, this is going so well”.

Neel: I then get kind of stressed because I’m concerned that I’m having wishful thinking and I try to break it and falsify it and come up with experiments that would show that actually life is complicated. A third mode of research is what I call trying to gain surface area where I just have a system that I’m pretty confused about. I just don’t really know where to get started, often I’ll just go and do things that I think will get me more information. Just go and plot stuff or follow random things I’m curious about in a fairly undirected fuzzy way. This mode of research has actually been most productive for me. When, or at least what I think about will feel my biggest research insights. It feels like it’s been downstream of this kind of exploratory fuck around and find out mode.

Michaël: So the first mode is you have an hypothesis and you want to verify it. The second is you think you’re wrong and you try to find counter examples for why you’re wrong.

Neel: No, no, I think I’m right, but I’m insecure about it. So I go and try to prove that I’m wrong instead.

Michaël: That’s something people often do when they’re trying to increase their confidence in something. They try to find counter examples, to find the best counter arguments. And the third one is just explore and gain more information and plot new things.

Neel: You could paraphrase them as, “Isn’t it really obvious what’s going on?”, “Oh man, am I so sure about this?” and “Fuck around and find out”.

Michaël: Fuck around and find out.

You Can Be Both Hypothesis Driven And Capable Of Being Surprised

Michaël: Is there anything that you think people don’t really understand about your method that is under appreciated or surprising? And if people were to watch 20 hours of you doing things, they would be like, “oh, he actually spends that amount of time doing X”.

Neel: I think people underestimate how much this stuff can be hypothesis driven and how useful it is to have enough of an exposure to the problem and enough of an exposure to the literature of what you find inside models that you can form hypotheses. Because I think that this is often just really useful.

Michaël: I want to push back on this because Neel Nanda from other podcasts. So I’ve listened to your podcast with Tim Scarfe on ML Street Talk and you say kind of the opposite where you say like, oh, you need to be willing to be surprised. You need to don’t have a part of this so much and just like, and you say that multiple times you need to be willing to be surprised. So I’m kind of feeling that the Neel Nanda from a few months ago would disagree here.

Neel: So I think these are two simultaneously true statements. It’s incredibly important that you have the capacity to be surprised by what you find in models. And it is often useful to go in with a hypothesis. I think the reason it’s useful to have a hypothesis is that it’s just often really hard to get started. And it’s often really useful to have some grounding that pushes you in a more productive direction and helps you get traction and momentum. And then the second half is it’s really important to then stop and be like, “Wait a minute, I’m really fucking confused.” Or “Wait, I thought I was doing this, but actually I got the following discomfirming evidence.

You Need To Be Able To Generate Multiple Hypothesis Before Getting Started

Neel: You can kind of think of it as “strong beliefs weakly held”. I think you should have… being good enough that you can start to form hypotheses, being at the point where you can sit down, set a five minute timer and brainstorm what’s going on and come up with four different hypotheses is just a much, much stronger research position than when you sit down and try to brainstorm and you come up with nothing. Yeah, maybe having two hypotheses is the best one. You want to have multiple hypotheses in mind. You also want to be aware that probably both of them are wrong, but you want to have enough engagement with the problem that you can generate experiment ideas. Maybe one way to phrase it is if you don’t have any idea what’s going on, it’s hard to notice what’s surprising. And often noticing what’s surprising is one of the most productive things you can do when doing research.

All the theory is bullshit without empirical evidence and it’s overall dignified to make the mechanistic interpretability bet

Michaël: This take about be willing to be surprised is from ML Street Talk. It’s a four hour podcast. I highly recommend watching it. And I think there’s a few claims that you make in there that I think are interesting. I don’t want to go all in because I think people should listen to the ML Street Talk podcast, but I will just prompt you with what I think is my summary of the takes and you can give me the neonatal completion of the prompts.

Neel: Sure, that sounds fun. I love being a language model.

Michaël: It’s good practice. All the theory is bullshit without empirical evidence and it’s overall dignified to make the mechanistic interpretability bet.

Neel: I consider those two different claims.

Michaël: Make two outputs.

Neel: Yes. So, I don’t know. I think there is a strong trend among people, especially the kind of people who get drawn to alignment from very theory based arguments to go and just pure theory craft and play around with toy models and form beautiful, elegant hypotheses about what happens in real models. The turnouts be complete bullshit. And there’s a kind of person who will write really detailed research proposals involving toy models that never has the step of like, and then go and make sure that this is actually what’s happening in the real language models we care about. And I just think this is just a really crucial mistake that people often make. And real models are messy and ugly and cursed. So I vibe, but also you can’t just ignore the messy, complicated thing that’s the ultimate one we want to understand. And I think this is a mistake people often make. The second thing is that, I don’t know, mechanistic interpretability seems hard and messy, but like it seems of embarrassing how little we’ve tried. And it would just be so embarrassing if we make AGI and it kills everyone and we could have interpreted it, we just didn’t try hard enough and didn’t know enough to get to the point where we could look inside it and see the “press here to kill everyone”

Mechanistic interpretability is alien neuroscience for truth seeking biologists in a world of math

Michaël: Second prompt. Mechanistic interpretability is alien neuroscience for truth seeking biologists in a world of math.

Neel: I ilke that take. I don’t have anything better to say on that take. Well phrased.

Michaël: [silence]

Neel: Okay, fine. I have stuff to say.

Neel: The way I think about it, it’s models have lots of structure. There’s all kinds of underlying principles that determine what make it natural, what algorithms are natural to express if you’re a language model. And we just don’t really know how these work. And there’s lots of natural human intuitions for how the stuff works, where we think it should look this, and we think it should look that. I did not expect that the way modular addition was implemented in a model was with Fourier transforms and trigonometry entities, but it turns out that it is. And this is why I think it’s really crucial that you can be surprised. Because if you go into this not knowing that you can’t, not having the ability to notice, “Wait, this is just a completely different ontology to what I thought. Everything is cursed. Give up and go home.”

Michaël: There’s something about the world of math part where all the language models are doing things from matrix multiplication or sometimes non-linearities, but it’s mostly well understood. So in biology, we have, let’s say, a map of the territory and we’re just thinking about cells and atoms and everything. But here we have this very rigid structure that is giving birth to this alien neurons or it’s human math giving birth to alien neuroscience.

Actually, Othello-GPT Has A Linear Emergent World Representation

Neel: Yep. Yeah, another good example here is this work I was involved in based on this Othello paper where the headline result of the original paper was that they trained a model to predict the next move in this board game Othello and found that the model learned to simulate the state of the board. That you gave it these chess notation style moves black plays to cell C7 and then you could look inside the model on that token and see that it knew the state of the board for everything.

Neel: It knew that this move had just taken the following pieces and stuff like that. And this was a really popular, exciting paper. It was an oral at ICLR. People were really excited because it seemed to show that language models trained to predict the next token could learn real models of the world and not just surface level statistics. But the plot twist of the paper that I found when I did some follow-up work was, so they’d found this weird result that linear probes didn’t work for understanding what was happening inside the model. Linear probe is when you just look for a direction inside the model corresponding to say this cell is black or this cell is white and they’d had to train nonlinear probes. And this is weird because the way we normally think models think is that they represent things internally as directions in space.

Neel: If the model has computed the state of the board it should be recoverable with a linear probe. There should just be a direction saying this cell is black or something. And what I found is that the model does think in terms of directions but that it doesn’t care about black or white. It cares about “This has the same color as the current player or this has a different color from the current player.” Because the model was trained to play both black and white moves, the game is symmetric and thus this is just a more useful structure for it.

Neel: And this is just another cute example of alien neuroscience. From my perspective, the way I would compute the board is each move I would recursively update this running state. If you’re doing that, obviously you think in terms of black or white. Each player moves and it updates the last thing a bit. But this is just not actually how transformers work. Because transformers can’t do recurrence. They have to compute the entire board in parallel and the model is playing both black and white. So from its perspective, doing the current player’s color relative to that is way more important and way more natural.

You Need To Use Simple Probes That Don’t Do Any Computation To Prove The Model Actually Knows Something

Michaël: Just to go back to the linear probe thing, for people who don’t know, it’s training a classifier on the activations of the network and you’re trying to see if you can have a perfect classifier on the activations and if you have this then you’re pretty sure you found something, right?

Neel: Yes. So probing is this slightly conceptually cursed field of study in interpretability that’s trying to answer questions about what a model knows. And the classic thing people are doing is they’re trying to look for linguistic features of interest inside the model. Does it know that this is a verb or a noun or an adjective? And the thing you can do is you can take an activation inside the model, the residual stream after layer 17 or something, and you can just do a logistic regression or train a linear classifier or whatever thing you want to see if you can extract the information about a noun, verb, or adjective. And if you can, the standard conclusion is yes, the model has computed this.

Neel: The obvious problem is that we’re just kind of sticking something on top of the model. We’re just kind of inserting in a probe in the middle and we have no guarantee that what the probe finds is actually used by the model. It’s a purely correlational technique. And you can imagine if you take a really dumb language model and then your probe is GPT-3, I’m sure GPT-3 can figure out whether something’s an adjective, noun, or a verb. And thus your probe could just learn it itself. You have no real guarantee this is what the model is doing. And so the purpose of the probe, and so one of the core challenges you need to do, is have a probe simple enough that it can’t be doing computation on its own, and it has to be telling you what the underlying model has learned. And this is just kind of a hard problem.

Michaël: So instead of having a bunch of non-linearities and a bunch of layers, just have the most simple MLP with no nonlinearity and just a very simple classifier.

Neel: Yeah.

The Mechanistic Interpretability Researcher Mindset

Michaël: Another claim is what are the four main things you need to do to be in the right mindset to be a mechanistic interpretability researcher?

Neel: Yeah. So I think there’s a couple of things I said when I was on ML Street Talk. I don’t know if I remember any of them, so let’s see if I can regenerate them. So I think that it’s really important to be ambitious, to actually believe that it’s possible to genuinely understand the algorithms learned by the model, that there is structure here, and that the structure can be understood if we try hard enough. I think that it’s really important to just believe it’s possible. And I think that much of the field of interpretability kind of fails because it’s done by ML. You have this culture that you can’t aim for understanding, that understanding isn’t possible, that you need to just have lots of summary statistics and benchmarks, and that there isn’t some underlying ground truth that we could access if we tried hard enough.

Michaël: So being ambitious is actually possible. You can be ambitious and actually understand what’s going on.

Neel: Yeah. I think in some sense, this is one edge I have as someone who just doesn’t have a machine learning background. I think there’s a bunch of ways that the standard cultural things that have really helped with success in ML, this focus on benchmarks, this focus on empiricism. I empiricism. But this focus on like, make number go up and achieve SOTA on benchmarks, just like, is fundamentally the wrong mindset for doing good interpretability work.

Neel: Another point is being willing to favor depth over breadth. Models are complicated. A thing that often makes people bounce off mechanistic interpretability is they hear about it and they’re like, “Oh, but how do you know that this thing, this algorithm you found in one model generalizes to another model?” And I’m like, “I don’t. That’s the entire point.” There is a real ground truth to what different models have learned. And it’s possible that what one model has learned is not what another model has learned.

The Algorithms Learned By Models Might Or Might Not Be Universal

Neel: My bet is that in general, these algorithms are fairly universal, but like, maybe not. No one has checked it all that hard. And this is just like, clearly a really important thing that you just want to be able to take a model and find the truth of what that model has learned. And the steelman of the standard critique is that people think it’s just boring if every model has a different answer, and like, “Ah, it’s kind of a taste thing.” My guess is that in general, models have the same answer, but it’s that like, I am willing to take a specific model and go really deep into trying to understand how it works.

Michaël: I think that the level at which you say they’re kind of similar is more in biology where all animals, a bunch of mammals have hands and foot or something, but we don’t have the same hands. And so you expect transformers or structure or circuits instead of neural networks to have this kind of similar structure, but maybe vary in shape or color and those kinds of things.

Neel: Yeah. I think I expect them to be more similar than say the hands of mammals are, though I do expect things to get… it depends how you change it. if you just change random seed, my guess is most things are going to be pretty consistent. But with some randomness, especially for the kind of circuits the model doesn’t care about that much, which we might get to later with Bilal’s work on a toy model of universality. And then there’s like, if you make the model a hundred X bigger or give it a hundred X more data, how does that change what it learns? And for that, I’m like, well, I don’t really know.

Neel: Some things will be consistent. Some things will change. A final principle of doing good mechanistic interpretability work is… I think it’s really important… Actually no, two final principles.

On The Importance Of Being Truth Seeking And Skeptical

Neel: I think it’s really important to be truth seeking and skeptical, to really keep in mind models are complicated. It’s really easy to trick myself. I need to try really hard to make sure that I am correct. But I’ve entertained alternate hypotheses. I’ve tried to break my hypothesis. I’ve run the right to baselines and I’ve really figured, for example, a common mistake is people come up with some number and they’re like, I think that number is big. And they don’t have a baseline of like, “Oh, what if I randomly rotated this or shuffled these or randomly guessed.” And it turns out when you do that, some of the time the number is boring because they just didn’t really know what they were doing. And the final principle that I think is incredibly important is to have a real intuition for models, have read papers, a mathematical framework for transformer circuits and stared at them.

Neel: Be able to sit down and map out on paper the kinds of algorithms that a transformer might learn. Be able to sit down carefully and try to think through what’s going on and be able to tell this experimental method is principled or this experiment makes no sense because I’m training a probe in a way that’s makes this basis special, but this basis is not privileged. Or be able to tell like, wait, there’s no way this would be possible because of the causal mask of the transformer. I don’t know. I’m failing to come up with good examples off the cuff, but there’s just all kinds of things that make some methods just laughably nonsense if you know what you’re doing. And I think that often people will write papers who don’t have these intuitions and just kind of do garbage. If I can plug, there’s this great new set of tutorials from Calum McDougall at Arena for mechanistic interpretability, and I think that just going through all of those and doing all of the exercises and coding things up is getting you a fair bit on your way to developing these intuitions. I also have a guide at neelnanda.io/getting-started on how to get started in the field. I think either of these will put you in pretty good stead.

The Linear Representation Hypothesis: Linear Representations Are The Right Abstractions

Michaël: One last claim I think is kind of interesting is linear representations are somehow the right abstractions instead of neural networks.

Neel: Yeah, so okay, so there’s a bunch of jargon to unpack in there. So the way I generally think about neural networks is that they are feature extractors. They take some inputs “the Eiffel Tower is in Paris” and detect a bunch of properties “this is the word ‘the’ and it is at the start of the sentence” or “this is the tower token in Eiffel Tower”, “this is a European landmark”, “this is in the city of Paris”, “I am doing factual recall, the sentence is in English”, “this is a preposition”, “the thing that should come next is a city”, a bunch of stuff like that.

Neel: And a lot of what models are doing is doing this computation and producing these features and storing them internally somehow. And so a really important question you need to ask yourself if you want to interpret the model is how are these represented? Because internally models are just a sequence of vectors. They have a bunch of layers that take vectors and produce more vectors by multiplying them with matrices and applying various kinds of creative non-linearities. And so a thing you need to ask yourself is “how does this model work? How do these vectors contain these features?”

Neel: The hypothesis that I think is most plausible is this idea called linear representation hypothesis, which says that there’s kind of a meaningful coordinate basis for your space, a meaningful set of directions such that the coordinate in this direction is 1 if it’s the Eiffel Tower and 0 otherwise. The coordinate is 1 if it’s in English, 0 otherwise. It’s 1 if it’s the 0 otherwise. And by looking for each of these different directions, the model is capable of implementing a bunch of complex computation. And one of the main reasons you might think this is intuitive is that the models are made of linear algebra. And if you’re feeding something into a neuron, basically the only thing you can do is project onto different directions and add them up.

Neel: A thing that is sometimes true is that there are individual meaningful neurons in the model, but a neuron is just a basis element. And so if a neuron is meaningful, like it fires when there’s a cat, it doesn’t fire otherwise, then the basis direction for that neuron is a meaningful direction that means cat. And one of the main complications for this is this weird-ass phenomena called superposition that I think we’re going to get to at some point, or possibly should segue onto now, who knows.

Michaël: Yes, let’s move on to superposition.

Superposition

Superposition Is How Models Compress Information

Neel: Yeah, so I think the example I was giving earlier of the sentence “the Eiffel Tower is in Paris” is probably a good example. So we know that the model knows the Eiffel Tower is in Paris. It’s somehow able to look up Eiffel Tower and get this information on Paris. But Eiffel Tower is a pretty niche feature. Eiffel Tower is not incredibly niche, but models know all kinds of extremely niche things. who Eliezer Yudkowsky is, for example, is solidly worth knowing, but also kind of weird and niche. And 99.99% of the time, this is not going to come up. And so it’s kind of weird for a model to need to dedicate a neuron to Eliezer Yudkowsky if it wants to know anything about him, because this is just going to lie there useless most of the time. And empirically, it seems models just know more facts than they have neurons.

Neel: And what we think is going on is that models have learned to use compression schemes. rather than having a dedicated neuron to represent Eliezer Yudkowsky, you could have 50 neurons that all activate for Eliezer Yudkowsky and all boost some Eliezer Yudkowsky direction a little bit. And then each of these 50 neurons also activates for 100 other people. But they activate for a different set of other people. So even though each neuron will now boost 100 different people vectors whenever it activates, it will boost the Eliezer, the 50 Eliezer neurons will all activate in Eliezer and all constructively interfere on the Eliezer direction, while destructively interfering in everything else. And superposition is broadly this hypothesis that models can use compression schemes to represent more features than they have dimensions. Exploiting this sparsity, this fact that Eliezer Yudkowsky just doesn’t come up that often. So it doesn’t matter that each of these neurons is representing 100 different things, because the neurons, the 100 things they represent are never going to occur at the same time. So you can get away with the neuron doing a bunch of different things at once. And this is a really big deal, because… So a thing which seems to be kind of true in image models is that neurons were broadly meaningful.

The Polysemanticity Problem: Neurons Are Not Meaningful

Neel: Like, there would be a neuron that meant a car wheel, or a car body, or a cat, or a golden retriever fur, and things that. But what seems to happen in language models much more often is this phenomena of Polysemanticity is the model… It’s the neuron activates through a bunch of seemingly unrelated things, Eliezerer Yudkowsky and Barack Obama, and list variables in Python code. And this is really annoying, because in order to do mechanistic interpretability on a model, you need to be able to decompose it into bits you can actually understand and reason about individually.

Neel: But you just can’t do that if the model is messy. You just can’t do that if there aren’t individually coherent bits. But if models are using compression schemes superposition, neurons may not be the right units to reason about them. Instead, maybe the right unit is some linear combination of neurons. And so one thing that people might be noticing is I’ve given this framing features as but then I’m also claiming that the model can fit in more features than it has dimensions. And you’re like, if you’ve got a thousand dimensional space, you can’t have more than a thousand orthogonal directions. But what seems to be going on is that models, in fact, use almost orthogonal directions.

Neel: You can fit in exponentially many directions that have dot product 0.1 with each other rather than 0, even though you can only fit in linearly many things with 0 dot product, because high dimensional spaces are weird, and there’s just a lot more room to squash stuff in. And so long as things are sparse, most of these vectors are empty and don’t occur on any given input, the fact that they have non-trivial interference doesn’t really matter. There’s two things to distinguish here. There’s the input weights and the output weights. The input weights are when the neuron activates, and what is it detecting? And we can totally have a neuron that activates on Eliezer Yudkowsky and the Eiffel Tower.

The Residual Stream: How Models Accumulate Information

Neel:And then there’s the output weights, which is like, what features in the model does this boost in what we call the model’s residual stream? It’s accumulated knowledge of the input so far. And what we generally find is there would be some Eliezer Yudkowsky feature, some direction that gets boosted, and some Eiffel Tower direction that gets boosted. And the obvious problem is there’s interference. The model is now going to have non-zero information in the Yudkowsky direction and in the Eiffel Tower direction. But if it has two Yudkowsky neurons and two Eiffel Tower neurons, and only one overlaps, then when Eliezer Yudkowsky is there, it’ll get +2 in the Eliezer direction. While when the Eiffel Tower is there, it gets +1 in the Eliezer direction. And so it can tell them apart.

Michaël: How I understand it: from your explanation the residual stream is kind of this skip connection in ResNet, when you don’t do any extra computation, but you just pass the kind of output to some other neurons, you skip a connection. And there’s a framing of residual stream where you consider these skip connections to be the main thing going on, and the rest is extra steps. And so what you’re saying is that basically by passing to the residual stream you pass the information to some main river, the main flow of information.

Neel: Yeah, I think this is an important enough point. It’s worth a bit of attention to explain it. So people invented this idea of residual connections, where in a standard neural network, the it works is that the input to layer n is the output of layer n-1. But people have this idea of adding skip connections. So now the input to layer n is the output of layer n-1 plus the input to layer n-1. We let the input kind of skip around with an identity. And this turns out to make models much better. And the way people always draw it is with this central stack of layers with these tiny side connections. But if you look at the norm of the vectors passed along, it is actually the case that the tiny skip connection is actually much bigger. And most circuits in the model in practice seem to often skip multiple layers. And so the way I draw models is with a big central channel called the residual stream with tiny skips to the side for each layer that is an incremental update. The residual stream is this big shared bandwidth the model is pursuing between layers that each layer is reading from and writing to as an incremental update.

Superposition and interference are at the frontier of the field of mechanistic interpretability

Neel: A couple of insights about superposition. The first is that this is just very much a frontier in the field of mechanistic interpretability right now. Like, I expect my understanding of superposition is a lot more advanced than it was six months ago. I hope it will be far more advanced six months from now than it is right now. We’re just quite confused about the geometry of how models represent things internally. And I think this is probably the big open problem in the field. Better understanding this. One thing which I’ve tried to emphasize, but to make explicit, is that a really important fact about language is sparsity. This fact that most inputs are rare. Most features Eliezer Yudkowsky or the Eiffel Tower are rare. Because superposition is fundamentally a tradeoff between being able to represent more things and being able to represent them without interference.

Neel: Where Eliezer Yudkowsky and Eiffel Tower sharing a neuron means that if either one is there, the model needs to both tell Eliezer Yudkowsky was there and also the Eiffel Tower is not there, this is just interference. And there’s two kinds of interference. There’s the interference you get when both things are present. What I call simultaneous interference. Eliezer Yudkowsky is in the Eiffel Tower. And there’s alternating interference where Eliezer Yudkowsky is there but the Eiffel Tower is not or vice versa. And if these are rare, then basically all of the interference you get is alternating, not simultaneous. And language is just full of extremely rare features that tend not to occur at the same time.

Finding Neurons in a Haystack: Superposition Through De-Tokenization And Compound Word Detectors

Neel: A paper that one of my mentees, Wes Gurney, works on called Finding Neurons in a Haystack we tried to look for empirical evidence of how models did superposition. And we found that one area where they used it a ton were these de-tokenization neurons or compound word detectors. So the input to a model is these words or tokens. But often words aren’t the right unit of analysis. The model wants to track compound words or it has a word that gets broken up into multiple tokens. Alpaca gets tokenized as space ALP, A, CA. And clearly you want to think of this as a word.

Neel: What we found is there are neurons that seem to do a Boolean and on common sequences of tokens to de-tokenize them, to recognize them on things prime factors or social security or blood pressure. And an important property of these is that you can never get these occurring at the same time. It is literally impossible to be the pressure and blood pressure and the security and social security at the same time. Because a token just can’t have different tokens at the same token. It doesn’t make any sense.

Not Being Able to Be Both Blood Pressure and Social Security Number at the Same Time Is Prime Real Estate for Superposition

Neel: I’m saying the trivial statement that a token cannot both be the pressure and blood pressure and the security and social security. And so when models want to do this algorithm of recognized sequences of tokens, they can never occur at the same time. Which means that this is prime real estate for superposition. Because it’s just like, I will never have simultaneous interference. This is amazing. I can just do lossless compression. I can just have a hundred social security neurons, each of which represents another thousand compound words. And it’s so efficient. I’m in love. And in practice, this seems to be what models do. And I don’t know, I think this was a really cool paper and Wes did a fantastic job. I also think it’s kind of embarrassing that this was basically the first example of a real case study of superposition of language models. And one thing I’m trying to work on at the moment is getting more case studies of this. Because I think that one of the main units of progress in the field of mechanistic interpretability is good detailed case studies.

Michaël: When I was watching you walk through or reading the abstract, there’s something about only activating a certain part of the outputs and masking the rest. There’s some factor K or something that you change.

Neel: Yes. So the actual paper, we’re looking into this technique called sparse probing. So the idea, this was much more WES’s influence than mine. I’m much more interested in the case studies. But so the idea of sparse probing is we used to think that individual neurons were the right unit of analysis. But with superposition, we now think that linear combinations of neurons are the right unit of analysis. But our guess is that it’s not the case that most neurons are used in most, it’s not the case that every neuron is used to detect Eliezer Yudkowsky.

Neel: Most neurons are off. While some neurons are important here. And so we asked ourselves the question, if we trained a linear classifier to detect that Eliezer Yudkowsky is there, how sparse can it be? Where this time I mean a totally different notion of sparsity. Sorry for the notation confusion. This time it’s how many neurons does it use? And we were like, okay, it can use one neuron. I find the neuron which is most correlated with Eliezer Yudkowsky being there or not being there. And I see how good a predictor this is. Next, I take the best pair of neurons and I use these, or the best quintuple or dectuple of neurons. And you see how good you are detecting the thing for different numbers of neurons. And you can use this to quantify how sparsely represented different things are.

Michaël: So it’s the more neurons you have, the more accurate you become? How much you can have a little bit of neuron and still detect it?

Neel: Yeah. So this turns out to be quite a conceptually thorny thing. So let’s take the social security example. A thing models are very good at is just storing in the residual stream information about the current token or recent tokens. It’s very, very easy to train a probe that says the current token is security or the previous token is social. And if you just train a probe to detect social security, the easiest way for this to work is that it just like, the easiest way for this to work is that it just detects current token is security. That’s a direction. Previous token is social. That’s a direction. The sum of these two, that’s a direction.

Michaël: So in some sense, you’re having those, all these direction for each individual token being in the right order and the mix of them will be the entire group, the linear combination that we’re talking before.

Neel: Yes. And like, this is boring. this is not the model detecting the compound word social security. This is just a mathematical statement about detecting linear combinations of tokens. But models do not. In order to detect the multi-token phrase, the model is going to intentionally have certain neurons that are activated for social security and which don’t normally activate, which they’re not going to have for say social lap or social Johnson or something, some nonsense combination of words. And a test you can do, which we didn’t actually make it into the paper, but probably should have is show that if you just want to detect known combinations of words, it’s a lot easier to do this than random unknown combinations of words. However, if you let the model use every neuron, it can detect random combinations of words very easily because of this current token is security, previous token is social phenomena. But because it’s not intentionally specializing neurons for it, it’s much harder to train a sparse probe for it.

Michaël: There’s also another thing about the other sparsity where they do these experiments where they make some features as we’re saying, more sparse or less sparse. I think maybe that’s something you can explain.

The Two Differences Of Superposition: Computational And Representational

Neel: Yeah. So, OK, so the first thing to be clear about is there’s actually two different kinds of superposition, what I call computational and representational superposition. So representational is when the model takes features it’s computed in some high dimensional space and compresses them to a low dimensional space in a way that they can be later recovered. For example, models have a vocabulary of 50,000 tokens, but the residual stream is normally about a thousand. You need to compress 50,000 directions into a thousand dimensional space, which is just a pretty big lift. You need to do a lot of compression for this to work. But you’re not doing anything new.

Neel: Your goal is just lose as little information as possible. Find some encoding that’s just convenient and thoughtful and works. And then computational is when you want to compute some new features. you know the current token security and the previous token is social. And you want to create a new feature that says they are social security. That is their combination. I can start thinking about welfare and government programs and politics and all of that stuff.

Michaël: This is very dangerous.

Neel: Apologies, I’ll try not to get you cancelled.

Michaël: It’s very dangerous if models start to understand all these things. If they start understanding politics and every very abstract concept, that means that we’re getting close to human level.

Neel: It doesn’t understand politics. It just knows that if social security is there, Trump and Obama and Biden are more likely tokens to come next. That’s the politics feature. It’s a very boring feature. I haven’t actually checked if this exists, but I’m sure that exists.

Michaël: So when you say computational feature, it means that they’re doing this to save computation?

Neel: No, what I mean is this is the algorithm learned by the model. It is useful for downstream computation to know I am talking about social security right now and not, say, social media. Because both have the token social, but they’re very different things that are very different concepts which have very different implications for what should come next. And the thing we looked at in finding neurons in a haystack is computational superposition. We looked into how the model computes features social security detection.

Neel: We also looked into a bunch of other things. We found individual neurons that seem to detect French, this text is in French, or neurons that seemed to be detecting things this is the end of a sentence and stuff that, or detecting facts.

Toy Models Of Superposition

Neel: The paper that you were asking me about, Toy Models of Superposition, this really, really good anthropic paper, which is probably one of my all-time favorite papers, they were mostly looking at representational superposition. So the point of this paper was just we’re going to look into toy models because we think we want to understand why neurons are polysemantic in a real model. And we don’t know why. We’re kind of confused about this. And yeah, we’re kind of confused about this. And the… yeah. And we think it’s because they’re doing superposition, but no one’s actually seen them doing. Can we build any setting at all where superposition is actually useful and use this to study its properties? And honestly, I would have probably predicted this would just not work because it’s too divorced from real models. And I do in fact think that using toy models cost them a lot in this area. But they also just got so much done and so many insights that I was wrong and this was a great paper. So you know, points to Chris, better researcher than me.

Neel And so, okay, what’s going on here? What they found was they had this setup where they had an autoencoder. It had a bunch of features as inputs that were like, each feature was uniform between zero and one. But it was also off most of the time. It was normally set to zero. One is uniform between zero and one. It had 20 of these. And then they made it compress it into a small dimensional bottleneck, five dimensions linearly, have a linear map back up, and then they gave it a ReLU on the end to do some cleaning. And this is an ideal setting to test for representational superposition because they trained it to see how well it could recover its input from its output while compressing it and decompressing it in this low dimensional bottleneck in the middle. And they found all kinds of wild results.

Neel: Notably, they found that it could learn to use superposition. And they would often learn these beautiful geometric configurations, where there would be, say, it would learn to compress nine features to six dimensions that would spontaneously form three orthogonal subspaces of size two each, each of which contains three features that are compressed as an equilateral triangle. Or it would have five dimensions, and two of them would be an equilateral triangle, where each feature gets two thirds of a dimension. And the other three would be a tetrahedron, where there’s three quarters of a dimension each. And I personally would bet most of this doesn’t happen with real models, and it’s just too cute by half. But it’s also just really cool.

Neel: One insight they found that I think does generalize is they found that as they vary the sparsity of these features, how often they’re zero, how rare the feature is, the model becomes a lot more willing to use superposition. And the reason this is an intuitive thing is what I was saying earlier about alternating versus simultaneous interference. If a dimension contains two features, the two features are not orthogonal to each other, then if both are there, it’s now quite hard to figure out what’s going on, because it looks each feature is there on its own really strongly.

Neel: Models are kind of bad at dealing with this shit. But if exactly one of them is there, then in the correct feature direction it’s big, while in the incorrect feature direction, which is not orthogonal, but it’s also not the same, it’s small. And the model can deal with that kind of stuff. It uses the relevant output to clean up. And so what’s going on is that the model, as you change the probability each thing is non-zero, because it’s sparse, because they’re independent, the probability that one is there and the other one isn’t is 2p minus p squared, and the probability that both are there is p squared. So when p is tiny, the probability they’re both there is order p squared. Well, the probability one of them is there is order p. And as p gets tiny, p squared gets smaller much faster than p does. So the cost of simultaneous interference gets tripled.

Michaël: Right, so the cost of interference is very small because of this quadratic cost and p is smaller than 1. And when something is very close to zero, so Yudkowsky appearing one in a billion times, it’s trivial to make the neuron detect both Yudkowsky and Neel Nanda that because they never happen at the same time. Or at least maybe in this podcast, maybe they happen all the time together.

Neel: Yeah, see, Neel Nanda and Eliezer Yudkowsky is actually a pretty bad… Any example I can think of is de facto a bad example. But you could imagine, I don’t know, some niche, some contestant on Masterchef Season 17, Eliezer Yudkowsky, probably never going to co-occur apart from literally that sentence.

Michaël: And so when the model has a bunch of features to take into account and they’re all kind of rare, it forms this beautiful geometric structures that are not completely orthogonal, but more like, you said, some platonic shapes.

Neel: Yeah, tetrahedron, you have square antiprisms where you have eight in three dimensions, which is really cute.

Michaël: My main criticism is that this happens in mostly toy models, right? A few layers, MLPs or transformers.

Neel: Oh no, no, much toyer than that. Just linear map to small dimension, linear map to big dimension, single layers. That is the model they started with.

Michaël: And so the goal of your paper was to have something… to test it on real models?

Neel: Yep. I think I’d probably just close by reiterating, but I think this is just probably the biggest frontier in mechanistic interpretability right now. We just don’t know how superposition works and it’s kind of embarrassing and it would be pretty good if we understood it way better. And I would love people to go do things go and build all the work we did in the neurons in a haystack paper. Go and try to understand what superposition looks in practice. Can you erase a model’s memory of the word social security? How many neurons do you have to delete to do that? I don’t know.

SERI MATS: The Origin Story Behind Toy Models Of Universality

How Mentoring Nine People at Once Through SERI MATS Helped Neel’s Research

Michaël: And I think it’s a right moment to maybe talk about doing research on superposition as a whole because I think today you can work on your own, work with Anthropic or work with Neel Nenda as a new opportunity. It’s this SERI MATS scholarship. You can be one of the mentees you have right now. Maybe explain quickly what’s SERI MATS and why do you have seven people you mentor?

Neel: Nine. Thank you very much.

Michaël: Because I think Wes… I met Wes working on this paper in December on another batch of people working on with you. And I think right now I’ve met other people that you work with right now as part of another batch. I think Arthur is releasing a new paper. And I also talked to Bilal about a paper he presented. I think one of the first SERI MATS paper that he’s presented at some conferences that he also done with you as well. So yeah, maybe talk about SERI MATS as a whole and how you do with your mentees.

Neel: I should clarify, Arthur’s paper was nothing to do with me and I can claim no credit. But he is currently one of my MATS Scholars and I can totally claim credit for the paper we’re about to put out. It’s going to be great. Better than his previous one because it has me on it this time. But yeah, so SERI MATS is this organization who were like, “Hmm, it sure seems there’s a ton of really talented people who want to do alignment work and a bunch of alignment people who would mentor people if someone made them.” Or someone was like, “Here are 10 smart people to go mentor.” But where this isn’t happening on its own, I think Evan Hubinger, Victor Warp, and Oliver Zhang were some of the main people who tried to make this happen initially with Evan as the original mentor. And Evan is a machine who was like, “Yep, I can mentor seven people. This is fine. I’ll spend my Fridays on it. This is chill. And just get shit done.” And this is now one of the biggest programs for alignment internships, I think, out there.

Michaël: To be clear for people who don’t know what SERI MATS stands for, it’s this Stanford Existential Risk Organization called SERI. Stanford Existential Risk Institute. And then MATS is Machine Alignment Theory Scholar or something?

Neel: Yeah, something like that. Intern isn’t really the right frame for it. It’s more like, you’ll go and do independent research under the guidance of a mentor. My system is I’m a fairly, but not incredibly, hands-off mentor. I’m excited about and invested in the research a scholar produces and have check-ins with them once a week and generally try to be engaged in their projects. If they’re blocked, I try to help them get unblocked. I try to help provide concrete experiment ideas and motivation and some amount of guidance and just try to make it less of a horrifying, sarcastic experience than doing independent research. And one thing I’ve just been really pleasantly surprised by is how many great people there are out there who want to do mechanistic interpretability research and how time-efficient mentoring is, where I don’t know. It just feels great research happens with two hours a week from me per project. And there’s just a bunch of really competent people who are mostly executed autonomously. Yet I’m actually adding significant value by providing guidance and an outside perspective and mentorship and connections. I think it’s just a really cool thing that MATS facilitates.

Michaël: It’s like having a PhD supervisor that actually cares about you and is actually fun and actually is interested in your work.

Neel: Thank you. I to think that I am better than the average PhD supervisor. It’s a low bar, so I feel I probably meet this. But yeah, one thing I didn’t really expect to go into this is I think I’ve just been really good for my career to do a lot of mentoring because I’m just learning really useful skills on how to advise research, how to lead a team, how to generate a bunch of ideas, how to help other people be more effective rather than just doing this myself. And one thing I’m currently trying to figure out is taking more of a leading role on the DeepMind mechanistic interpretability team. And I think I’m just in a much, much better position from having spent the past, I don’t know, coming on a year doing a bunch of mentoring in my spare time. And also good papers happen. It’s just such a good deal. I don’t know why more people don’t do it. I also have the hypothesis that I’m just really extroverted in a way that gives me the superpower of I can just casually have nine mentees in the evenings and just chill and geek out about cool projects happening.

Michaël: I think the superpower is that you gain energy from talking to people, right? And so you enjoy it, you’re recharged in the evening. And some people have told me that compared to other people or other mentors, you can just have this one hour call with Neel Nanda at the end of the day and it becomes two hours because you just talk… You kind of enjoy doing this. You’re not even like, it’s not a time commitment. You just actually enjoy helping people.

The Backstory Behind Toy Models of Universality

Michaël: One person that I think is kind of a good example of this is Bilal. So the paper we’ve talked about, I think it’s toy models of universality. And first SERI MATS paper, I was at ICML in Hawaii and I recorded Bilal doing this presentation. And there was like… the only logo on the paper was SERI MATS. And the authors were Neel Nanda and Bilal that were I believe independent at the time.

Neel: Lawrence Chan was also on there. I can’t remember what he put. I think he might’ve put UC Berkeley because he used to be a UC Berkeley PhD student. But he’s now at ARC evals and used to be at Redwood and does all kinds of random shit.

Michaël: And apparently, I guess this idea for this paper came from a SF party, if I remember what you said on other… What’s the main idea here?

Neel: So the backstory of the paper is, so there was this paper called Grokking. This weird ass phenomena where you train some tiny models on an algorithmic task, modular addition. And you find that it initially just memorizes the training data. But then if you keep training it for a really long time on the same data, it will abruptly generalize or grok and go from can’t do the task to can do it. And it’s a very cool weird. And this was a really popular paper because people were just like, what the fuck? Why does this happen?

Neel: We know that things can memorize. We know that they can generalize, but normally it just does one and sticks there. It doesn’t switch. What’s going on? And there was this great story that the reason they found it is they trained a model to do it. It failed. But then they just left it training over the weekend. And when they got back, it had figured it out. And I don’t know if this is true, but it’s a great story. So I sure hope it’s true. And one of them and the like, the paper I discussed earlier about modular addition, progress measures for grokking via mechanistic interpretability. The seed of this was I saw the paper and I was like, these are tiny models on clean algorithm tasks. If there was ever a mystery that someone made to be mech-interpred, it was this one. And the algorithm I found generalized a fair bit. It covers modular subtraction and multiplication and division, which was some of the other tasks in the paper.

Neel: But there was this whole other family of tasks about composition of permutations of sets of five elements, composition of the group S5. And this was completely different. And I had no idea how this happened. And I was at a party and I raised this to some people around me as a puzzle. And two people there, Sam Box and Joe Benton, were interested. And they first came up with the idea that representation theory was probably involved, which is this branch of 20th, 19th, 20th century mathematics about understanding groups in terms of understanding how they correspond to sets of linear transformations of vector spaces.

Neel: After the party, Sam actually sent me a last rung message with the first draft of the algorithm. We ended up concluding the model learnt. So there’s since been some further research that suggests the algorithm might have just been completely wrong. I don’t really know what’s up with that. I’m leaving it up to Bill Al to go figure out whether this is legit and tell me about it because I haven’t got around to actually reading the claim to rebuttal yet. But kind of embarrassing if we just significantly misunderstood the algorithm. But whatever, science, people falsify it. Progress marches on.

From Modular Addition To Permutation Groups

Neel: So yeah, all groups have these things called representations. And for example, for the permutation group on five elements, you can take the four dimensional tetrahedron, which has five vertices, and any linear map that maps the tetrahedron to itself, rotations and reflections, permutes the vertices. And there’s actually an exact correspondence. For any permutation, there’s some linear map that does it and vice versa. So you can actually think about the linear map. You can actually think about group composition on these permutations of five elements as being linear transformations on this four dimensional tetrahedron, which is a four by four matrix. And what we seem to find is that the model would internally represent these matrices. Though this is kind of awkward to talk about because apparently there was a fairly compelling rebuttal that I haven’t engaged with yet. So maybe we didn’t show this. Who knows? interpretability is hard, man.

Michaël: Yeah, I’m sorry if this is wrong and we expose you on a podcast about it. But I guess the main idea is somehow you can map things between how different groups in mathematics can do isomorphism or just kind of things between permutation of five elements or linear map between different tetrahedron. And you can find an acute or a nice way of looking at this. And this maps exactly to modular addition, right?

Neel: Yes, with modular addition, the representations are rotations of an n-sided shape. You can add in five mod seven is equivalent to rotating a seven sided thing by five times a seventh of a full term.

Michaël: I’m kind of curious about the thing we discussed before with the sinus and cosinus and all the mathematics that you decompose what the model was doing. So is the model doing some kind of computation that is similar to cosinus and sinus and at the same time has this different mapping with the group of permutation as well at the same time?

Neel: No. So the sines and cosines are the group representation. In the case of modular addition, the group representation is rotations of an n-sided shape, which is the same as rotations of the unit circle. And the way you represent a rotation is with sines and cosines. And it turns out that the algorithm I found about composing rotations was actually an algorithm about composing group representations. That just happens to also have this form of Fourier transforms and trigon entities in the simple case of modular addition, which is in some sense the simplest possible group.

Michaël: Right. So this paper is more general framing than the actual decomposition into cosines and sinus.

Neel: Yes. We found a generalization of the algorithm, which we thought we showed at the lab. Maybe we were wrong. Who knows?

The Model Needs To Learn Modular Addition On A Finite Number Of Token Inputs

Michaël: And there’s something else you say about this, which I think is interesting. We tend to think about the model needs to learn sines and cosines on very complex functions, but actually it only needs to learn the correct answer on a finite number of token inputs.

Neel: Yeah. The input the model receives is just two inputs, A plus B, where each of them is an integer between 0 and 113. Because I did addition mod 113, because that was just a random number I picked that was prime, because primes are nice. And so they need to know what sine of A times some frequency is. But because A can only take on 113 values, it only needs to memorize 113 values of the sine function. And this is very easy. Models literally have a thing called the embedding that is just a lookup table. For people who are familiar with the idea of one-hot encodings, the idea is like it’s one-hot encoded and then multiplied by a matrix, which is equivalent to a lookup table. And it’s not that it knows what sine of 0.13 is, and that’s different from sine of 0.14, it just knows its value on the 113 possible inputs it sees for A and the 113 possible inputs it sees for B. Because it just cannot ever see anything else.

Michaël: There’s an embedding with 130 values, and then whenever it needs to look at this value, it can just do a one-hot dot product or something that?

Neel: Yeah, exactly. And this is all baked into the model. One way to think about it is that there is some underlying ground truth of the representation of the sine wave that is the region where the model performs the algorithm properly. And we just give it 113 data points and said, “Smush them to be on this underlying sine wave.” And the model does a bunch of trial and error until “Okay, this is about the right point.” And it needs to do this 113 times. But it’s not that it’s learned to do the real valued computation of the actual wave, which is a much harder task. And people often are just like, “Oh my God, it learned how to do trigonometry” And “No, it memorized 113 numbers. Not that hard, man.”

Michaël: So to some extent, it’s doing less reasoning than we think. It’s just doing literally interpolation on the sine curve or something?

Neel: Yeah, it’s a one-layer model. The impressive thing is that it realized the sine curve was useful. Once you… Not that it had the capacity to learn it. They can learn arbitrary functions. It’s a lookup table. It can do anything. But yeah.

Why Is The Paper Called Toy Model Of Universality

Neel: The actual narrative of Bilal’s paper, the reason we called it a toy model of universality… one really interesting thing happens once you have these representations. Because the representations you get, there’s actually multiple of them for each group that are qualitatively different. You can get rotations of different frequencies, which are just fundamentally different from each other. No, that’s a terrible example. So with the permutation group, there’s this linear transformations of the four-dimensional tetrahedron. But there’s also a bunch of other stuff. I think there’s a transformation of the main diagonals of a dodecahedron or an icosahedron, for example. And that’s just a totally different group. And this is… Or that might be the alternating group. I don’t know, man. Geometries are weird. And these are just qualitatively different algorithms the model can learn. And so an interesting question is, which one does it learn?

Neel: There’s this hypothesis called universality that says that there are underlying true things that models learn that they will systematically converge on. And what we found here is that there’s actually a ton of randomness. As you just vary the random seed, the model will learn… It will learn multiple of these representations, but it will also learn different things each time. And this is just kind of weird. And you would have guessed that it would learn the simple ones. And naively, you would think like, oh, things which are three-dimensional shapes are easier to learn than things that are four-dimensional shapes. So obviously it will learn the 3D ones rather than the 4D ones. And what we find is there’s a little bit of a bias towards lower dimensional stuff, but it’s like… It doesn’t really correspond to our human intuitions.

Neel: It’s kind of weird. And we don’t really have a good story for why. But importantly, it’s both not uniform. It prefers to learn certain representations than others, but it’s also not deterministic. And this is like… I don’t know. I don’t think anyone actually believed this, but there was this strong universality hypothesis that models would always learn exactly the same algorithms. And I think we’ve just clearly disproven this, at least with these small models. Though the exciting thing is that there’s a finite set of algorithms that can be learned. And you can imagine in real life, without having access to this ground truth, learning some periodic table of algorithms where you go and understand how a model learns something, interpret five different versions of five different random seeds, and learn a complete set of all algorithms the model could learn.

Michaël: So basically the model has different algorithms it can learn. And it doesn’t learn always the same, but there’s a set of things it can learn that it’s able to learn.

Neel: Yeah. Here’s a hypothesis you could have. We looked at a toy model. This is not that much evidence. I think it’d be pretty cool if someone went and did this properly.

Michaël: So the Bilal paper about the S5 group is part of the grokking research you’ve done. And one paper I think is one of the most famous papers you’ve done is “Progress measures for grokking via mechanistic interpretability,” who also has a walkthrough on your channel.

Neel: Yes. I highly recommend it. It’s a great walkthrough.

Michaël: For people who don’t have four hours to listen to it…

Neel: There are people where you don’t have four hours? What kind of viewership do you have, man? But yes, I should clarify, because Bilal will kill me. His paper was not just on S5. His paper was a grand ambitious paper about many groups, of which S5 was one particularly photogenic example.

Progress Measures For Grokking Via Mechanistic Interpretability, Circuit Formation

Neel: But yes, so what happened in this “Progress measures for grokking” paper? So this is the one where I reverse engineered modular addition, which I’ve already somewhat discussed. We found that you could just actually reverse engineer the algorithm the model had learned, and it had learned to think about modular addition in terms of rotations around the unit circle. And in my opinion, the story of this paper and the reason it was a big deal and that I’m proud of it is… sorry, those are two different things.

Neel: The reason I’m proud of the paper is that lots of people think that real interpretability is bullshit. They’re just like, “Ah, you can’t understand things. It’s all a discreditable black box. You have hubris for trying and should give up and go home.” And whatever random crap people talk about nowadays. I try to stop listening to the haters. And the thing…

Michaël: Neel Nanda, August 3rd: “Don’t listen to the haters”

Neel: I’m pro-listening to good criticism of specific interpretability work, to be clear. Criticism is great. And also, lots of interpretability work, which is kind of bad. So that I’m pretty in favor of. But yeah, it’s kind of like… I think I just very rigorously reverse-engineered a non-trivial algorithm. I went in not knowing what the algorithm would be, but I figured it out by messing around with the model splits. And I think that is just a really cool result that I’m really glad I did. That I think is just a good proof of concept that the ambitious mechanistic interpretability agenda is even remotely possible.

Neel: The second thing was trying to use this understanding to explain why grokking happened. And so as a reminder, grokking is this phenomena where the model initially memorizes the data and generalizes terribly. But then when you keep training it on the same data again and again, it abruptly generalizes. And what I found is that grokking was actually an illusion. It’s not that the model suddenly generalizes. Grokking actually splits into three discrete phases that we call memorization, circuit formation, and clean-up. In the first phase, the model memorizes what it says on the tin. But then there’s this, this is not going to transfer well over audio, but whatever.

Neel: If you look at the grokking loss curve, its train loss goes down and then stays down, while test loss goes up a bit. It’s worse than random, and it remains up for a while. And it’s during this seeming plateau that I call circuit formation, where it turns out the model is actually transitioning from the memorizing solution to the generalizing solution, somehow keeping train performance fixed throughout. And it’s kind of wild that models can do this.

Neel: The reason it does this, I don’t claim this is fully rigorously shown in the paper, this is just my guess, is there’s something weird going on where it’s easier to get to the region of the loss landscape where the model is doing the right thing. But it’s easier to get to the region of the loss landscape where the model is doing the thing. Doing the thing by memorization than generalization. But we’re training the model with weight decay, which creates an incentive to be simpler, which creates an incentive to do it more simply.

Neel: This means that the model initially starts memorizing because it’s easier to get to, but it wants to be generalizing. And it turns out it is possible for it to transition between the two while preserving test performance, which is kind of surprising. A priori, but in hindsight it’s not that crazy. And then so why does test loss crash rather than going down gradually? So this is the third stage, called clean. So during circuit formation, the model is still mostly memorizing, and memorization generalizes really badly out of distribution, which means that the model just performs terribly. On the unseen data. And it’s only when it’s got so good at generalizing that it no longer needs the parameters it’s spending on memorizing, that it can do clean up and get rid of the parameters it’s spending memorizing. And it’s only when it’s done that, the model performs better. It’s actually able to perform well in the data it hasn’t seen yet, which is this sudden grokking crash. Or spike. And this is not sudden generalization, it’s gradual generalization followed by sudden clean up.

Michaël: Do we have any evidence for this circuit formation that happens gradually? Have you tried to look at the circuits and see if they could solve simpler tasks?

Neel: Yeah, so this is the point of our paper. The most compelling metric is what I call excluded loss, where we, this is a special metric we designed using our understanding of the circuit, where we delete the model’s ability to use the rotation-based algorithm, but we keep everything else the same. And what we find is that early on in training, excluded loss is perfect. It’s about as good as training loss. But as time goes on, during circuit formation, excluded loss diverges until it’s worse than random, even though training loss is extremely good the whole way.

Advice on How To Get Started With Mechanistic Interpretability And How It Relates To Alignment

Getting Started In Mechanistic Interpretability And Which WalkthroughS To Start With

Michaël: And so for people who want to work with you on SERI MATS projects or collaborate on research, or want to learn more about making interrupts, do you have any general direction you would recommend people going through?

Neel: Yeah, so I have this blog post called “Getting Started in Mechanistic Interruptibility.” You can find it at Neelnander.io/getting-started. That’s basically just a concrete guide on how to get started in the field. Much of what I have people do during the first month of SERI MATS is just going through that blog post, and I think you can just get started now. There’s a lot of pretty great resources on the internet at this point on how to get into mechanistic interruptibility, a lot of which is because I was annoyed at how bad the resources were, so I decided to make good ones. And I think I succeeded. You are welcome to send me emails complaining about how much my resources suck and how I shouldn’t do false advertising on podcasts. And yeah, I don’t know how much longer I’m going to continue having Slack on the side of my job to take on MATS Scholars. I’m hoping to get at least another round of MATS Scholars in, which I guess would be I don’t know, maybe about two cohorts a year. I don’t know exactly when the next one’s going to be, but just pay attention for whatever MATS next advertises.

Michaël: And yeah, I guess for your YouTube work, because this is probably going to be on YouTube, do you have any video or intro that you recommend people watching? After this podcast, what should they start their binge on?

Neel: Yeah, so I think probably the most unique content I have on my channel is my research walkthroughs, where I just record myself doing research and upload it. And I think, I don’t know, I’m very satisfied with this format. I feel like it just works well. It’s kind of fun and motivating for me. And you just don’t really normally see how the sausage gets made. You see papers, which are this polished, albeit often kind of garbage, a inal product that’s like: “here is the end thing of the research”. But if you’re getting into the field, the actual skill is how to do it. And I think watching me and the decisions I make is educational. I’ve got pretty good feedback on them. Probably the like, yeah, my second most popular video is what? Also my second ever video, it’s all gone downhill since then, man. Which is just a recording of myself doing that. I, as I mentioned, have 16 hours of additional recordings that I did talking about scholars that I’ll be uploading over the next few weeks. And there’s a long marathon one about looking into how GPT-J learns to do arithmetic. That’s a 6 billion parameter language model.

Michaël: Yeah. I’m really excited to have this GPTJ walkthrough. As I think I’m in Daniel Filan’s house. So I thought of this Daniel Filan question: what is a great question I haven’t asked you yet? Or that I forgot to ask you.

Neel: Let’s see.

Why Does Mechanistic Interpretability Matter From an Alignment Perspective

Neel: You haven’t at all asked me why does mechanistic interpretability matter from an alignment perspective?

Michaël: Assume that I asked and that you have a short answer?

Neel: I kind of want to give the countercultural answer of like, I don’t know, man, theories of change, backchaining, it’s all really overrated. You should just do good science and assume good things will happen. Which I feel is an underrated perspective in alignment.

Neel: My actual answer is like, here’s a bunch of different theories of change for interpretability at a very high level. I don’t know, man, we’re trying to make killer black boxes that we don’t understand. They’re going to take over the world. It sure seems if they weren’t black boxes, I’d feel better about this. And it seems one of the biggest advantages we have over AI is we can just look inside their heads and see what they’re doing and be like, the evil neuron is activating. Deactivate. Cool. Alignment solved. And I don’t actually think that’s going to work.

Neel: But it just seems if we could understand these systems, it would be so much better. Some specific angles that I think interpretability seems particularly important. One I’m really excited about is auditing systems for deception. So fundamentally, alignment is a set of claims about the internal algorithms done, implemented by a model. you need to be able to distinguish an aligned model that is doing the right thing from a model that is just learned to tell you what you want to hear. But a sufficiently capable model has an instrumental incentive to tell you what you want to hear in a way that looks exactly the same as an aligned model. And the only difference is about the internal algorithm. So it seems to me there will eventually be a point where the only way to tell if a system is aligned or not is by actually going and interpreting it and trying to understand what’s going on inside. That’s probably the angle I’m most bullish on and the world where I’m most like, man, if we don’t have a top, we’re just kind of screwed. But there’s a bunch of other angles.

How Detection Deception With Mechanistic Interpretability Compares to Collin Burns’ Work

Michaël: For the deception angle, I’ve had Collin Burns in December on his Contrast-Consistent Search and deception work. Not deceptive work, but how to detect deception. And he basically was saying “oh man, if we had those linear probes or whatever, where whenever the model is lying, it will say like, ‘oh, I am lying. I am trying to deceive you or something.’” Or the model being fully honest… you can ask him like, “hey, are you lying right now?” and the model would say “yes, I am lying” He was pretty bullish on things going well at this point.

Neel: Yeah. And I would put Colin’s work in a fairly different category. I kind of see mechanistic interpretability in the really ambitious “big if true” category where we’re pursuing this really ambitious bet that it is possible to really understand the system. And like, this is fucking difficult and we’re not very good at it. And it might be completely impossible. Or we might need to settle for much less compelling, much weirder and jankier shit. And I put Colin’s work in the category of like, I don’t know, kind of dumb shit that you try because it would be really embarrassing if this worked and you didn’t do it. But like, you don’t have this really detailed story of how the linear probe you train works or how it tracks the thing that you’re trying to track. It’s very far from foolproof, but it seems to genuinely tell you something. And it seems better than nothing. And it’s extremely easy and scalable and efficient. And to me, these are just conceptually quite different approaches to research. I’m not trying to put a value judgment, but I think this is a useful mental framework for a viewer to have.

Michaël: One is big if true is if we understand everything. Then we’re pretty much saved. And one is pretty useful and easy to implement right now, but maybe has some problems.

Neel: Yeah. Mechanistic interpretability has problems. It has so many problems. we suck at it and we don’t know if it’ll work. But yeah, it’s aiming for a rich detailed mechanistic understanding rather than here’s something which is kind of useful, but I don’t quite know how to interpret it.

Michaël: I think we can say this about the podcast as well. Quad seems quite useful, but I’m not sure how to interpret it.

Neel: I’m flattered.

Final Words From Neel

Michaël: I don’t have much more to say. If you have any last message for the audience, you can go for it. But otherwise, I think it was great to have you.

Neel: Yeah, I think probably I’d want to try to finish off the thoughts on why mechanistic interpretability matters from my perspective. I think one of the big things to keep in mind is just there’s so many ways that this stuff matters. You could build mechanistic interpretability tools to give to your human feedback raters so they can give models feedback on whether it did the right thing for the right reasons or for the wrong reasons. You could create demos of misalignment. If we’re in a world where alignment’s really hard, it seems really useful to have scary demos of misalignment.

Neel: We can show policy makers in other labs and be like, “This thing looks aligned, but it’s not. This could be you. Be careful, kids, and don’t do drugs,” and stuff that. It seems pretty useful for things understanding whether a system has situational awareness or other kinds of alignment-relevant capabilities. I don’t know. You don’t need full, ambitious mechanistic interpretability. It’s entirely plausible to me that if any given one of these was your true priority, you would not prioritize this fucking blue skies mechanistic interpretability research. I think that it’s worth a shot. It seems we’re having traction. I also think it’s really hard. We might just completely fail. Anyone who’s counting on us to solve it should please not, man. Day to day, I generally think about it as, “It would be such a big deal if we solved this. I’m going to focus more on the down-to-earth scientific problems of superposition and causal interventions and how to do principled science here and accept that things kind of suck and accept that I don’t necessarily need to be thinking super hard about a detailed theory of change.” Because getting better at understanding the killer black boxes just so seems useful.

Michaël: For me, this sounds like a great conclusion. I think if we solve the ambitious version of mechanistic interpretability, we can go to governments and show exactly what happens. We can pin down exactly what we need to know to align these models. I think it’s a very important step, at least.

Neel: It would be so great if we solve the ambitious version of mechanistic interpretability. I would just be so happy. Such a win.

Michaël: If you want to make Neel Nanda happy, solve mechanistic interpretability, go check out his YouTube channel, check out his exercise, check out his papers, figure out if the Bilal paper is true or not.

Neel: Thank you very much. No worries. It’s great being on. If people want to go check out one thing, you should check out my guide on how to get started, neelnanda.io/getting-started.

Michaël: To get started, check out get started.

Neel: It’s well-named, man.