Jesse Hoogland on Developmental Interpretability and Singular Learning Theory

Jesse Hoogland is a research assistant at David Krueger’s lab in Cambridge studying AI Safety. Before he came to grips with existential risk from AI, he co-founded a healthtech startup automating bariatric surgery patient journeys.

More recently, Jesse has been thinking about Singular Learning Theory and Developmental Interpretability, which we discuss in this episode.

(Note: this episode was recorded outside, without any preparation, so this conversation is a bit less structured than usual. The audio is also at times noisy because of the wind. And, as always, you can click on any sub-topic of your liking in the outline below and then come back to the outline by clicking on the green arrow)



Jesse: Hello YouTube, Michaël here. Today I’m here with Jesse to talk about developmental interpretability and interpreting God’s programming language.

Michaël: I’m sorry, I’m Michaël Trazi and you’re in the Inside View. And today is day 4 of the vlog and we’re at the Alamos Park in San Francisco. It is currently very windy and we’re very cold, so we’re trying to be quick here. So here is actually Jesse’s Hoogland. What is interpretability and why are you interested in it?

Jesse: Interpretability is about understanding what’s going on inside of neural networks. Why do we care to understand what’s going on inside of neural networks? Well, because they might want to kill us. And if we can understand what’s going on inside of them, we might be able to stop that.

Why Is Jesse Concerned About AI Risk

Basic AI risk arguments, why care about AI risk at all

Michaël: I don’t understand. I’m a simple AI user. I use ChatGPT every day and it never tries to kill me. It’s just being very nice to me and being very useful. And I heard that you can just give him instructions. You can just tell him if it’s good or bad and at the end it just learns how to be very useful to humans. So if we just scaled those models up and tell them to be useful to humans, then why would they want to kill us? Why would they behave differently from the fine-tuning data?

Jesse: So maybe we just keep making these systems bigger and they just end up being friendly and caring about us. On the other hand, what about humans? Humans, you make them bigger and smarter and faster. And do we care more about chimpanzees now? Do we care more about robots?

Michaël: I think there’s more and more people who care about animal suffering. Less and less people eating meat. I think some people will say that being more ethical is some sort of convergence in human moral or something. Because people tend to be more altruistic as they grow older or learn more things about the world.

Jesse: I think the important reason to be concerned about future AI systems, even if current AI systems seem reasonable, is that power decreases the margin for error. The more power you inject into a system, the more capable they become. It turns out if they’re only slightly misaligned with us, slightly different to us in values, they can have big consequences. And so we have to be very careful when we get to very large, powerful systems.

Michaël: If I’m someone who believes that technology will mostly be good for humanity, I would think that why would we want AI to be aligned with our values? Why can’t we just have AI have their own values? Maybe human values are so hard to understand. Why can’t we just have a very good chatGPT, a very useful tool that helps humans and have different values, but it’s just going to be useful?

Jesse: So I see you want me to give the entire AI X-risk intro. Let’s start at the beginning.

Michaël: No, no, no, no, no. We don’t need to start in the beginning. I’m just like… when I heara someone say we need to align our AIs to our values and otherwise if they’re like slightly misaligned, that they are very dangerous… I can see a lot of people watching this and being like, “No” or “Why?”, “Do we need to have it understand human values?”, “Why does it start wanting to kill us?”. I think those questions are valid in some sense.

Jesse: I think they’re valid too. I don’t think it’s obvious. I’m not one of those total doomers who think there’s a 99% chance of doom.

Michaël: So what’s the actual probability?

Jesse: On a good day, like coin flip odds, maybe 60% odds of doom.

Michaël: On a bad day?

Jesse: 80% plus.

Jesse’s Story And Probability Of Doom

Michaël: How do you wake up in the morning and get out of bed knowing that there’s possibly like an 80% chance of every single human you’ve ever met dying?

Jesse: I’m just your average guy. I’m unconscious most of the day long, working, grinding away.

Michaël: So you work unconscious?

Jesse: Just like anybody else really. No, but in reality I think it’s gotten a lot easier since I actually started working on AI safety.

Michaël: When was that?

Jesse: Probably about a year ago that I decided to do this full time.

Michaël: What was the turning point?

Jesse: I was working on a health tech startup. I had started a company. We were automating bariatric surgery patient journeys. At some point I realized that I did not care about it at all.

Michaël: You didn’t care about the health of other fellow humans?

Jesse: That’s a charitable interpretation. Let me rephrase it. I did not care about the kinds of problems that I was dealing with. The kind of stupid software engineering, the sort of mindless web development that was going into it. It wasn’t cut out for me. That combined with the fact that I’d been growing ever more concerned, at some point led to a tipping point. I knew I had to do something.

Michaël: You already knew about AI alignment or AI safety at the time. When was the first time you heard about it?

Jesse: Many years ago. Back in 2015 when I read Superintelligence by Nick Blastrom. At the time I just thought, “Oh this is very far off in the future. Not immediately relevant in my lifetime.”

Michaël: So you thought, this is someone else’s problem, not mine?

Jesse: Someone else’s problem or maybe I can’t contribute much to this yet. Or maybe I saw that a lot of the kinds of problems people were thinking about were not problems I was familiar with.

Michaël: I heard you were somehow a physicist before, you studied physics. So this is mostly your background, you learn about physics and then you worked for this biotech company and then you started doing AI. Is this correct or are there other lives, other stories?

Jesse: Those are the main ones. And the rest, I’m going to leave a mystery.

Michaël: But now I’m curious about the present moment. What are you currently interested in and what do you see as a promising path towards helping out with alignment?

Singular Learning Theory

How Jesse Got Into Singular Learning Theory

Jesse: I think there are many things we should be doing and I’m excited about many things. But for myself, I’m most excited about bringing ideas from physics into making sense of what’s going on inside of neural networks. So that’s where this singular learning theory comes from. You can see it as a kind of application of thermodynamics to neural networks.

Michaël: Singular learning theory? You’re saying it as if I knew everything about it.

Jesse: I think the best way to think about singular learning theory is that it’s something like the thermodynamics of learning. Or maybe the statistical physics of learning. So it takes a bunch of ideas that we understand pretty well from physics and applies them to neural networks to making sense of them. It’s a relatively new field, it’s like two decades old. All invented by this one brilliant guy in Japan, Sumio Aranabe, who just saw the ingredients and clicked them into place and started to apply them to neural networks and other systems like that.

Michaël: That was a few years ago or is that still going on? When did he do that?

Jesse: This was more than a decade ago that he saw the ingredients that would lead to singular learning theory. It’s actually a very general theory that applies to many other systems. Only recently have people started to see that it could be relevant for alignment.

Michaël: Were you one of those people? Was there someone else who saw the link between this and alignment?

Jesse: I was first introduced to this subject by Alexander Hitling-Oldenziel.

Michaël: It’s a very long name.

Jesse: It is a very long name. We were going back from some conference and I asked him, Alex, what do you think are the two or three most important research directions within AI safety? After thinking about it for a second, he looks at me and sort of, fine, you’re not going to be interested in this anyway, but it’s singular learning theory. Then this one other thing, computational mechanics, epsilon machines. The way he phrased it as if I wasn’t going to be interested in it. Obviously, that sparked the interest.

Michaël: You took this as a challenge and you spent a week trying to learn it.

Jesse: Right. He sent me this thesis and I tried to read it. I recognized some of the words, but it was just, what the hell is going on here? After a second pass and a third pass, some things were starting to make sense. It fit into the things I’d learned during my physics masters.

Intuition behind SLT: the loss landscape

Michaël: If I’m a random YouTube subscriber watching this video right now, I have the sense that it’s something linked to thermodynamics or physics and something with neural networks. If you had a machine learning engineer in front of you, someone who knows neural networks, how would you explain it simply or the intuition behind it?

Jesse: When we are training a neural network, there’s this idea of a loss landscape. You’re trying to move down this lost landscape to find better solutions. What singular learning theory tells us is that it’s a few points in this landscape that determine the overall learning behavior. That’s a similar insight to the one we’ve had in physics, where it’s the geometry of the energy landscape that decides many of the physical properties we’re interested in. SLT tells us that by understanding the local geometry around these critical points, think of places that are flat, so equilibria, stable or unstable, and similar points like that. They decide the overall learning behavior.

Michaël: Because there’s some convergence towards local optimality kind of things?

Jesse: That’s one way to think about it. In very simple situations, you can imagine you have a ball. You have a ball running around this ball. It’s just going to eventually settle to the bottom because there’s friction. It’s dissipating heat and it’s losing energy. It settles at the bottom. In general, points like this minimum, they determine the qualitative shape of that landscape. JesseThe other important thing that we observe from these singularities, hence the name singular learning theory, is that they are somehow simpler than other points in this landscape. When you’re learning, you want to come up with the simplest possible explanation for the data. This is Occam’s razor. What SLT tells us is that these singularities are somehow simpler, so by finding them, you can generalize better. It explains part of the reason that neural networks work at all.

Michaël: What’s an example of a singularity inside the lost landscape?

Jesse: What’s an example? Think of the classic example. If people try to think about what the bottom of the loss landscape looks like, a lot of them will imagine something that looks like a bowl. It’s a round paraboloid. It’s not a bowl. There are valleys. There are usually directions you can walk along that don’t change the loss. It’s not just a valley. It’s like a canyon structure. There are intersections between valleys. Those kinds of points are the singularities that really define the structure.

Michaël: What happens between two valleys?

Jesse: If you have two valleys, another way that machine learning researchers talk about this is they talk about different basins. Different basins seem to correspond to different computational structures, different ways of implementing a function or implementing some algorithm. If they’re qualitatively different algorithms, they’re different solutions. That’s very relevant for us. We want to understand what is this neural network actually doing? Is it lying to me? Is it being deceptive? Those kinds of questions start to get at this question of how is it actually generalizing? How is it actually performing computations internally?

Does SLT actually predict anything? Phase Transitions

Michaël: I’m going to ask a very blunt question. Do we actually understand anything new using singular learning theory? Can we actually predict things that we weren’t predicting before?

Jesse: I think the most interesting predictions made by SLT are phase transitions. It tells us that we should expect there to be phase transitions, so discrete changes, sudden changes, in the kinds of computations being performed by a model over the course of training. I should give that a little caveat. A lot of this theory is built in another learning paradigm, Bayesian learning. There’s still some work to apply this to the learning paradigm we have in the case of neural networks, which is with stochastic gradient descent. It’s this ball running down the hill. What it does tell us is it predicts these phase transitions, and that would tell us something very significant about what makes neural networks different from other kinds of models.

Michaël: What are the very different things that you predict?

Jesse: We have lots of evidence for things like phase transitions in neural networks. When people talk about phase transitions in machine learning, a lot of it’s kind of hand-wavy. Right there, saying, oh, there’s a sudden drop in the loss, or a sudden drop in some other metric. It’s questionable, and they kind of want to borrow the physics language because it sounds nice. The thing is, they’re probably right. These are probably actually phase transitions in the physical sense of the word phase transition.

Michaël: For things like grokking?

Jesse: Like grokking, another example would be this induction heads paper, where they find that at a certain point in training, there’s this bump where models suddenly learn how to do… if A is followed by B earlier in a text, it then predicts B, if you see A again later in a text. There are a bunch of phenomena like these that still need to be really linked to the picture from SLT, but if it’s the case that these are phase transitions as predicted by theory, that’d be a major selling point for SLT.

Why care about phase transition, grokking, etc

Michaël: Why do we really care about these phase transitions?

Jesse: First of all, they seem to represent the discrete changes in computation. That’s concerning for a safety perspective. A lot of these scenarios we think about when we think about why could AI be dangerous, a lot of these threat models run through some kind of discrete sudden change. Thinking of like, maybe the model suddenly learns how to be deceptive, and it’s able to lie now. And if it’s sudden, it gives us little time to react. So we want to be able to anticipate these moments, understand what’s going on in them. So that’s why phase transitions are relevant.

Michaël: Why phase transitions and not something else?

Jesse: Phase transitions, they matter. First of all, they matter from a safety perspective. When we’re thinking about the ways in which AI could be dangerous, we think about these threat models. A lot of these threat models run through a scenario in which a model goes through some sudden change. Maybe it suddenly develops a dangerous capability. It suddenly acquires a value that isn’t aligned with us.

Michaël:I think people don’t really know what a threat model is, except from people on LessWrong.

Jesse: So a threat model is like a concrete scenario you describe of one way that AI or some other risk might result in damage to people or extinction or whatever you want. So yeah, when you do alignment research, you try to prevent or circumvent those threat models by making your AI more aligned in this particular threat model, right?

Detecting dangerous capabilities like deception in the development, before they crystalize

Jesse: So I think one of the reasons we’re interested in interpretability is because if we can read the internals of the brain of these systems, we might be able to read things like deception and then prevent deception. That seems very relevant. But we want to be able to detect deception as it forms. We want to be able to know when these dangerous capabilities are first acquired because it might be too late. They might become sort of stuck and crystallized and hard to get rid of. And so we want to understand how dangerous capabilities, how misaligned values develop over the course of training. So phase transitions seem particularly relevant for that because they represent kind of the most important structural changes, the qualitative changes in the shape of these models internals.

Jesse: Now, beyond that, another reason we’re interested in phase transitions is that phase transitions in physics are understood to be a kind of point of contact between the microscopic world and the macroscopic world. So it’s a point where you have more control over the behavior of a system than you normally do. That seems relevant to us from a safety engineering perspective. Why do you have more control in a physical system during phase transitions? So phase transitions… So this is going to get a little technical.

A concrete example: magnets

Michaël: You can go more in like, just like concrete example, like if you’re driving a car, like a very simple thing like…

Jesse: So let me give you an example, right? So take a magnet, right? If you heat a magnet to a high enough temperature, then it’s no longer a magnet. It no longer has an overall magnetization. And so if you bring another magnet to it, they won’t stick. But if you cool it down, at some point it reaches this Curie temperature. If you push it lower, then it will become magnetized. So the entire thing will all of a sudden get a direction. It’ll have a north pole and a south pole. So the thing is though, like, which direction will that north pole or south pole be? And so it turns out that you only need an infinitesimally small perturbation to that system in order to point it in a certain direction. And so that’s the kind of sensitivity you see, where the microscopic structure becomes very sensitive to tiny external perturbations.

Michaël: And so if we bring this back to neural networks, if the weights are slightly different, the overall model could be deceptive or not. Is it something similar?

Jesse: This is speculative. There are more concrete examples. So there are these toy models of superposition studied by Anthropic. And that’s a case where you can see that it’s learning some embedding and unembedding. So it’s trying to compress data. You can see that the way it compresses data involves this kind of symmetry breaking, this sensitivity, where it selects one solution at a phase transition. So that’s a very concrete example of this.


Michaël: So I hear what you’re saying about physics. It sounds very interesting. And I’m glad that you’re doing this research with your background in physics. But I think most people don’t know about interpretability. I think it’s common in ML. But for people like the layman person on YouTube, why do we care about making models interpretable? And is there any way? What are the things people actually do for making models interpretable? What does it mean?

Jesse: There are several stories you can tell about interpretability. Maybe the easiest one is something like we can detect when it’s lying to us. What are they thinking? Can we read their minds? And if that works, that’d be great because we can detect things like deception maybe. We want to know when it’s lying to us because that could be when the model is hiding that it wants to do harmful things to us. So that’s what interpretability is trying to do. And traditionally there’s this field of mechanistic interpretability which is trying to reverse engineer what’s going on inside of neural networks. Ideally you could write down a program in a programming language after you’ve done mechanistic interpretability. And all of a sudden have something you can just read and understand what’s going on inside these networks.

Why Jesse Is Bullish On Interpretability

Reason 1: we have the weights

Michaël: I was looking at my comments yesterday and one guy was commenting like, if someone is very, very smart, it will just try to hide as much as it can until it finds the right opportunity. Why are you bullish on us being able to detect a very smart agent lying to us?

Jesse: Well, the first reason to be somewhat optimistic is the fact that we have the weights. We have these models on our computers. So neuroscience, we can’t read people’s minds yet. But there’s another problem with neuroscience, which is we don’t even know how to measure things going on in your brain. Or at least how to measure them accurately. And so at least with neural networks, we have all of that information available to us. That’s one strong thing for interpretability. What if it’s like learning online and is like being trained?

Reason 2: the Universality Hypothesis

Michaël: How would you detect something that is being trained and the weights are changing?

Jesse: So maybe we’ll get to that in a second. I just want to add one more reason. There’s just one more reason that you should think something like interpretability might be possible. And that’s this claim, this hypothesis of universality. So if you train vision models, so models that are trying to predict, you know, is this a cat or a dog? Or trying to draw a box around some image? What you find is that many different models, many different architectures, develop very similar internal representations. So they have very similar circuits that look for edges and then certain combinations of edges to form circles. Or maybe they look for a region with a high frequency and a region with a low frequency, which represents an edge. And these kinds of structures are universal across many different models.

Jesse: So there’s this universality hypothesis that something similar is true in general. So the universality hypothesis says that many different models, maybe trained on many different kinds of data, many different architectures, converge to similar internal structures and similar ways of reasoning about the world. The same structure that you have in Chris Olah’s blog post on circuits for vision networks, like every vision network would have the same circuits at the end. Are there like theorems or like convergence proofs on this or is it just mostly intuition right now? So right now, this stuff is very empirical. And there’s some more evidence, right? So if you look at the way that humans process images, it’s also very similar to the way these models do it. You have these Gabor filters, right? So the curve detectors.

Michaël: What filters?

Jesse: Gabor filters. These are filters that were developed long before we had neural networks. And then it turns out that these neural networks develop very similar kinds of filters, the ones we know that humans are implementing to the ones we used to already implement. So, yeah, so that’s an example of this. So we know that like those kind of filters is like a point of convergence, like those kind of filters end up like happening in some neural networks.

Michaël: Those emerge from most training process?

Jesse: Right, so this stuff is empirical. But there are theoretical reasons to think something like universality might be possible. And so again, this appeals to the physical picture. Where in physics, you have a different notion of universality. So it turns out that many different systems, you take magnets, you take certain fluids, and you study them close to phase transitions. It turns out that they behave qualitatively very similarly. So they have these scaling exponents. It turns out that their behavior close to these points is identical, independent, totally independent of the microscopic details of the structure.

Jesse: So if we expect something similar to hold for neural networks, then we can make connections there. Maybe this universality also applies in the case of learning machines.

Developmental Interpretability

Jesse: In any case, that’s what interpretability, mechanistic interpretability, helps to do. Now the problem is obviously with very large systems, how do you figure out all of the things that are going on inside of a neural network? Maybe you can find many of the big picture things, but it’s very hard to find all of the little details. So that’s a struggle with interpretability. Developmental interpretability proposes that we study how structure forms over the course of training. I think maybe it’s more tractable to find out what’s going on in the neural network at the end, if we just understand each individual transition over the course of training.

Jesse: The relevant thing about developmental interpretability is, suppose it’s possible to understand what’s going on inside of neural networks, largely understand them. First assumption. Well then, it’s still going to be very difficult to do that at one specific moment in time. I think intractable. I think the only way you’re actually going to build up an exhaustive idea of what structure the model has internally, is to look at how it forms over the course of training. You want to look at each moment, where you learn specific concepts and skills, and isolate those. Because that tells you where to look, where to look for structure. And so developmental interpretability is this idea that you should study how structure forms in neural networks. That actually might be much more tractable than trying to understand how structure is at the end of training.

Michaël: So it’s the overall process of seeing these small phase transitions, and seeing these small bones appear in a body or something. And at the end, when you have the full body, you’re like, oh, I know how it formed, I know where it created a new bone or a new structure.

Jesse: This is an excellent analogy, because the developmental biology is a place we should look for input. They actually have experience studying systems, very large systems, that grow in similar ways. So you look at cell differentiation, and that involves phase transitions that you can study.

Michaël: I think the devil’s advocate point of view would say that babies come with a big brain, that’s why the wombs of women are so large. And this is because we already have everything that is required for a human to be general in this kind of brain. So maybe the development is more like something about evolution than something about our brains, our brains start somehow general.

Jesse: Did a stork drop that baby off in that woman’s uterus? No. At some point, everybody is a single cell, a single fertilized cell, and then many cell divisions later, you end up with a baby. So if you understand those divisions, maybe it’s a more tractable way to understand what you end up with.

Michaël: For biology, do you think there is a reasonable amount of time humanity could take to understand how we go from cells to a body, all the single processes in the genome that creates humans as they are? If you think it’s possible with AI, do you think it’s possible with humans? Or maybe humans are more complicated? I think in the case of humans, it’s possible, in principle. I think the difficulty there is more in terms of measurement. And so this is the same thing with the difference between neuroscience and AI. With AI, we have the weights on our computers, and we can follow the entire development of these systems.

Michaël: I don’t really have the GPT-4 weights on my computer.

Jesse: You don’t. Someone does. And so it might still run into limits of tractability. It might be very expensive to run all that compute. But you don’t have to run through these hoops of figuring out even how to measure the thing, which is the central problem in biology.

What Happens Next? Jesse’s Vision

Michaël: I’m saying this as somebody who does not know. I’m curious what a perfect world looks like in your view. Imagine we get all these phase transitions, and we get all these structures that emerge, and we see them. And we have this huge model, let’s say, GPT-5. And we can identify maybe hundreds of those. What happens next? Then we can tell what the thing has learned, and we can see GPT-5 generating a huge paragraph, or a huge piece of text, and be like, now we know that there are those structures that explain… How deep and how precise can we go with this approach? Is it scalable?

Jesse: Let me give you two answers. First is the prosaic, boring answer. That’s something like, if you can understand when the dangerous capabilities form, if you can understand when misaligned values form, you can prevent them, if you understand this process sufficiently. And you can steer it in the right directions. And so developmental interpretability, this becomes something that supports many other kinds of alignment techniques. That’s the boring answer. The more speculative answer is, suppose this really works, and you get this deep understanding of the relation between structure and function. And then you can do something like eliciting latent knowledge, ELK. You can try to extract this knowledge, these skills, from the network, and implement it in another system, where you understand everything that is going on. So you distill. You use that to create verifiably safe systems. So you try to extract the knowledge from the dangerous model, which is going to be bigger, and you distill, and then you have like a… You can build a safer model from this. I don’t think this is easy. I think this is a problem on the order of the Manhattan Project. Probably harder. But I don’t think there’s anything in principle making that impossible.

Michaël: In your first answer, the boring answer, you can just see where the deceptive behavior emerged. And so maybe there are two counterpoints. One is like, ok, you see the deceptive behavior, but there’s nothing you can do about it, because it’s already here. And the second thing is, maybe if I try to channel my inner Yudkowsky, I would say something like, all the cognitive behaviors necessary to play… Not play chess, but solve very complicated math, require me to be good at deception. And so there’s no way of separating any of those behaviors. It’s like either you have very dangerous stuff, or you’re pretty narrow.

Jesse: I think this really gets to the point that it’s also unclear where you draw the divide between capabilities and values. Both of these things are implemented in these models. And so to the extent that structures like skills, abilities, emerge through phase transitions, we should expect something similar for values, which is what these systems actually want. And so I don’t think you end up with a system of fully interpreting these systems, and fully interpreting their capabilities, without being able to interpret their values and what they want. And so you’ll be able to see that this AI doesn’t value humans, or value paperclips, or… I mean, again, this is far from where we are right now, but it is the speculative picture.

Toy Models of Superposition

Michaël: So where are we right now?

Jesse: We are able to do the theory on the toy models of superposition, and we have lots of projects going.

Michaël: So the toy models of superposition is something you guys have worked on? Is it the thing you were talking about from Neil Nanda? What is this? Jesee: So the toy models of superposition is this model, very simple. You have some vector, just synthetic data, not interesting. Synthetic data, and you try to compress it, and then you try to expand it back to the original. Because there is this bottleneck, it’s actually a difficult learning task.

Michaël: It’s like an autoencoder, right?

Jesse: It’s like an autoencoder, exactly. Now, I mention this because it’s the model that’s currently best understood from the theory, from singular learning theory’s point of view.

Singular Learning Theory Part 2

Michaël: I still don’t understand what singular learning theory is.

Jesse: Singular learning theory is a theory of, say, hierarchical models. Hierarchical models like neural networks, like hidden Markov models, or other kinds of models. And these hierarchical models are very special because the mapping from parameters to functions is not one-to-one. So you can have different models. If you look at the weights, they seem like very different models, but they’re actually implementing the same function. So it’s actually a very special feature of these systems, and it leads to all kinds of special properties.

Michaël: When you say systems, is it like non-neural networks? This is more general than just learning… than neural networks? Is it like any learning algorithm?

Jesse: Right, so any singular learning algorithm, and that consists of these hierarchical model classes.

Michaël: Yeah, I feel like if you said hierarchical model classes, this is so general that I don’t really see a concrete example.

Jesse: It is very general, but that doesn’t lower the strength of the results.

Michaël: Is a neural network part of those singular learning models?

Jesse: Yeah, neural networks are singular, and that’s why they work.

Michaël: Cool, so your theory applies to neural networks, so that’s good. If someone is hearing you say stuff about hierarchical classes and someone is completely lost and has no idea what you’re talking about, do you have any explanation of the properties of those things that make it interesting? Why is something singular and not singular? What is an example of something not singular?

Jesse: The fact that you could have two different systems, and you look at the weights, so you look at the strengths of the connections between neurons, and they seem totally different, but actually they have the same input-output behavior. They’re implementing the same function.

Michaël: As long as you have a set of parameters that can be mapped to a function, and you have two different set of parameters that map to the same function, you say the overall thing is singular?

Jesse: Yeah, there are some more technical conditions. There are actually two things, but in any case, this is the important one to just keep in mind.

Michaël: Your current work or interest is to try to approach these kind of properties on singular learning— sorry, how do you call it? Singular models?

Jesse: Yes, singular models.

Michaël: To developmental interpretability, and see if you can identify the structure using those properties behind stuff that maps to the same function.

Jesse: Right now, my priority is testing some of the assumptions behind developmental interpretability. Those assumptions in particular are this question of phase transitions. Now, we’ve good reason to think the phase transitions exist. There are lots of different examples. But the question is, can we bridge the theoretical picture of phase transitions? Jesse’s early story

Michaël: I’m curious more about your story about alignment, or when did you—when Jesse Oakland started being concerned about AI posing an existential threat, was it at 16, at 12, or at 20? What was the moment when you said, oh damn, I need to spend my entire life on this?

Jesse: That’s a good question. I have always been interested in neural networks. These are just fascinating things and interesting systems to study. I remember reading in 2015 Superintelligence by Nick Bostrom.

Michaël: Very good book.

Jesse: It’s a good book still. At the time I thought, oh, AGI, that’s far in the future, it’s not here yet. I think it’s when I started using Copilot more regularly that I realized, oh shit.

Michaël: Like 2021, 2022?

Jesse: Yeah, exactly. That’s when I realized, oh, actually it’s happening soon. And then all the other concerns came to the foreground, and this curiosity suddenly weaponized into concern. What exactly in using Copilot made you think that AI was going fast?

Michaël: Was it just like you were using it for work, for your studies, and you were like, oh, it’s doing most of my work, or is it just like you didn’t know how to code very well and it was able to help you code and be much more productive?

Are Current Models Creative? Reasoning?

Jesse: I mean, there’s still this meme that goes around about stochastic parents, right? And that AIs are really just repeating what they’ve seen in the data, and it’s all non—they don’t actually understand what’s going on, they’re just repeating. And I really started to see, engaging with these systems, 2020, 2021, that there was more going on. They were actually solving original problems, they’re actually reasoning. There’s something there. And at that point I realized, okay, we’re in the endgame now.

Michaël: Where is the reasoning? If I’m like Gary Marcus or Tim Scarfe from ML Street Talk, I would say, where is the reasoning? I don’t see the reasoning.

Jesse: You pose original questions that you know are not in the data set. Like what? Niche programming problems that have never showed up before because you are doing something weird and new. It doesn’t take long to end up in a weird corner of creativity space that has never been explored before.

Michaël: But it’s just doing copy-pasting from Stack Overflow, and it’s just weirdly mixing up those concepts. It doesn’t really understand what’s going on.

Jesse: Yeah, I see what you’re saying. So I think here, the word creativity, right? Like look at the word creativity and it says something about creation. And I think that’s actually just a terrible word for the phenomenon. Most of the time creativity is about synthesis. It is about combining things together. So people have this weird myth in their heads about creativity needing to be about creating new original content. When really most of the work we’re doing is just combining things together.

Michaël: And so you saw Copilot and you were like, oh, this shit is creative. This is going faster than I thought. So what did you do next before you came here? What was the thing that you were interested in? What was the path that led you to have those thoughts about interpretability or how to solve alignment?

Building Bridges Between Alignment And Other Disciplines

Jesse: So I always thought there’s a lot of tooling from physics to pull into studying neural networks. And that’s taken a long time for I think other people to realize. A lot of physicists have tried to make their mark on machine learning. It’s just a thing physicists do is they go out into the world and see, oh, here’s a new field. I’m going to try to make my mark in this field and impose my arrogant physics ideas on this subfield. And yeah, there was still a lot of room to do that. And so I saw at some point I stumbled across singular learning theory. And I saw, oh, this is the natural place where physics comes into the picture to say something useful about AI and alignment. And when you saw those specific paths from singular learning theory, what’s the strategy? What’s the roadmap? What’s the thing that needs to happen for us to be alive in, let’s say, 50 years? We need to get a Manhattan Project going, I think. We need to involve all of the major scientists in the world. I want Terence Tao working on this problem. I want Konsevich working on this problem. I want the biggest physicists to all be thinking about this. I want every smart technical person on the planet leaving college to think about AI safety for a few years.

Michaël: How do we get those people like realistically to spend some time on the problem?

Jesse: You hit the subscribe button.

Michaël: And you support me on Patreon.

Jesse: And hit the bell.

Michaël: If you’re a famous physicist and you’re not subscribed.

Jesse: Well, it’s going to take a while. I mean, we’re not. Yeah, we’ve got a lot of work cut out for ourselves. But it’s going to look something like create lots of positions, research orgs, academic positions. Find ways to involve academia and a bunch of researchers from many different disciplines who have not been involved yet. And so developmental interpretability is a place where this happens. It’s a place where you can pull in developmental biologists. It’s a place you can pull in strength theorists and statistical physicists. It’s a place you can pull in algebraic geometers. It’s a place you can pull in algebraic geometers and a bunch of other mathematicians. There are other areas like this where we want to involve many different people. Not just from academia, but also from industry and other places. So it’s about building those bridges and doing the outreach. Is Interpretability the bridge between alignment and other disciplines?

Michaël: I agree that we need to bring more scientists to solve alignment. And I’d be very excited to have biologists or physicists working together on this. And we can build bridges by pointing out some work they can do, some academic papers that they can focus on.

Where To Learn More About The Topic

Michaël: What should I look for on the Internet to learn more about this topic?

Jesse: If you want to learn more about this technical subject of singular learning theory, developmental interpretability, you should go to singularlearningtheory.com. Another place you can go is metauni.org. They have a bunch of amazing seminars hosted in Roblox of all places.

Michaël: metauni.com?

Jesse: metauni.org, I believe. They host these online seminars and online university, inside of Roblox. I can confirm we are on the correct timeline. In any case, those are places you can find out more about these subjects if you’re interested in technical work on these subjects.

Michaël:I heard there’s also some popular Less Wrong post you wrote on singular learning theory. You can also read this blog post you wrote on Less Wrong.

Jesse: Look up “neural networks generalized because of this one weird trick”.

Michaël:So you were the one who wrote this clickbait article?

Jesse: Well, I know how it is for you. Clickbait is real.

Michaël: You just owned me, Mr. Clark Kent. You like Roblox, it seems. But it seems like you’re the real version of a Roblox character, a mix between Captain America and Clark Kent, a Roblox character. Why are you so ripped?

Michaël: What’s the purpose of all this? If you want to join the forces and join the army, this is the 4th of July. So I think we’re just going to stop here and maybe shoot some guns. If you want to join the forces, maybe we should follow him.