Curtis Huebner on AI Timelines and Alignment at EleutherAI
Curtis Huebner is the head of Alignment at EleutherAI. In this episode we discuss the massive orders of H100s from different actors, why he thinks AGI is 4-5 years away, why he thinks the probability of an AI extinction is around 90%, his comment on Eliezer Yudkwosky’s Death with Dignity, and what kind of Alignment projects is currently going on at EleutherAI, especially a project with Markov chains and the Alignment Minetest project that he is currently leading.
(Note: our conversation is ~2h long, feel free to click on any sub-topic of your liking in the outline below and then come back to the outline by clicking on the green arrow ⬆)
- Death With Dignity
- AI Extinction
- AI Timelines
- Compute And Model Size Required For A Dangerous Model
- Details For Curtis’ Model Of Compute Required, The Brain View
- Why This Estimate Of Compute Required Might Be Wrong, Ajeya Cotra’s Transformative AI report
- Curtis’ Median For AGI Is Around 2028, Used To Be 2027
- How Curtis Approaches Life With Short Timelines And High P(Doom)
- Takeoff Speeds—The Software view vs. The Hardware View
- Nvidia’s 400k H100 rolling down the assembly line, AIs soon to be unleashed on their own source code
- Could We Get A Fast Takeoff By Fuly Automating AI Research With More Compute?
- Substituting Human Labor With Compute
- The Entire World (Tech Companies, Governments, Militaries) Is Noticing New AI Capabilities That They Don’t Have
- Open-source vs. Close source policies. Mundane vs. Apocalyptic considerations
- Curtis’ background, from teaching himself deep learning to EleutherAI
- Alignment Project At EleutherAI: Markov Chain and Language Models
- Research Philosophy at EleutherAI: Pursuing Useful Projects, Multingual, Discord, Logistics
- Alignment MineTest: Links To Alignmnet, Embedded Agency, Wireheading
- Next steps for Alignment MineTest: Focusing On Model-Based RL
- Training On Human Data & Using an Updated Gym Environment With Human APIs
- Another goal of Alignment MineTest: Study Corrigibility
- [People ordering H100s Are Aware Of Other People Making These Orders, Race Dynamics, Last Message](#people-ordering-h100s-are-aware-of-other-people-making-these-orders-rac
Death With Dignity
About Death With Dignity
Michaël: Hi everyone, I’m here with Curtis Huebner, head of alignment at Eleuther.ai, and known on the internet as AI Waifu. People might recognize him as the one who commented on Eliezer Yudkowsky’s death with dignity by saying, “fuck that noise”. Can you explain for people who haven’t seen that comment, what your comment was about, and maybe what the post from Yudkowsky was as well?
Curtis: Yeah, sure. So at kind of a high level, the post that Eleuther posted on April Fools of all days is basically, Miri’s strategy is, we’re probably all going to die, almost certainly we’re all going to die. But really what we want to do is we want to maximize the sort of log odds of survival. So essentially, they’re saying, do what you can to not screw up existing alignment efforts. Do what you can to maximize the probability of survival. But in reality, we’re probably all going to die.
Curtis: And so, I saw this post, and there’s a bit of a misunderstanding, at least in my opinion, there’s a bit of a misunderstanding about what my comment says, and kind of the tone I took. Because one thing that happened with that post that I didn’t really like too much, but was maybe necessary for some people, is that the tone of the post was a very somber one. Kind of being honest about the gravity of the situation, and how deep and in trouble we all are. ⬆
The Metaphor Of The Stage Hunt
Curtis: And really, that was kind of the main thing I wanted to counteract with that post. Or with my comment on that post. I actually do, to a large extent, agree with people like Connor and Eliezer, that the probability that we’re all toast is very, very high. However, the way I kind of see it is that you can’t sort of give off vibes that you’re going to give up, so to speak, or that you’re going to accept anything less than success. And sort of the reason I think that is, I don’t know if you’re familiar with the game theory notion of a stag hunt. Essentially, you have two hunters, and the hunters have the choice of either hunting a stag or a hare.
Curtis: And if they go for the hare, there’s a very high probability of success. And so they’ll probably get the hare, but the hare doesn’t have a lot of meat, so it’s not a very big reward. If you go for the stag, though, the only way you actually manage to successfully hunt the stag is if the other hunter also goes and hunts the stag. And both hunters, in the traditional stag hunt, both hunters are faced with the same choice. So the ideal outcome is that kind of both of the hunters go and they go hunt the stag together, and then they get the stag, and then they get a lot of meat because the stag has a lot of meat instead of the hares, essentially. But for that to happen, both of them have to sort of trust each other that they’re not going to, you know, I guess it’s not really defecting, but like, are not going to try to go for the easier route, right? And I guess my perception when it comes to existing alignment efforts is that it’s like a stage hunt.
Curtis: Except that it’s not like two people. It’s like you need hundreds of people to get together, maybe thousands, maybe tens of thousands, all getting together to basically solve this, you know, solve the alignment problem, both technically and politically. And I think that because of that, at least for a lot of people who kind of understand the magnitude of it, you really can’t signal something like this sort of this somber attitude of failure. Even if it is very true that we’re almost certainly screwed because a lot of people are probably going to take the easier road of being like, well, we’re all dead. I’m going to go on vacation. You know, I’m going to enjoy kind of the rest of the life that I have. And so really, that was kind of one of the, at least like one of the motivations for making the comment that I did.
Michaël: So what you’re saying is basically like we should all go with everything we have because that’s the world in which we actually solve the problem and make progress. And his posts had the downside effect of people that might become depressed or sad about it and then try to defect and say like, you know, if we’re doomed, I might as well enjoy life for a few years. And so when you say fuck that noise or let’s try to listen to some very optimistic things and do the thing, that’s what you’re trying to encourage and give some wave of optimism, right?
Curtis: Yeah, and the thing is, it’s important to notice too that like the post that Eliezer wrote also served like a very important purpose, right? There’s a lot of people that I believe were sort of, I guess, sleeping at the switch maybe. Like they were kind of aware of the situation, but they weren’t like aware of the gravity of it. And I think that like for those people, the post had kind of would have the opposite effect, right? Of like, oh, wow, okay, you know, we don’t actually have this under control because the guys who are like their entire specialty is preventing this from happening are saying there’s a 0% chance essentially, or almost 0% chance that we’re going to succeed, right?
Curtis: So I think kind of like both messages were sort of necessary to say like, for the people that are aware of the gravity of the situation and are, you know, trying to do something, you know, knowing that there are other people out there trying to mitigate existential risks, and still kind of going at it with everything they have, but also like serving as a wake up call for people that may not have been paying as much attention. ⬆
How His Comment Relate With Dignity
Michaël: So for me, the post wasn’t that much about the probability of dying from AI extinction, as it was about like, how to behave with dignity, or like, not trying to do crazy things that might have bad effects in the long term, like second order consequences. And like, if you think that like blowing up TSMC, or like doing crazy things like this might be good to save the world, then instead, maybe you should consider like, dying with dignity instead, and like doing the, like the ethical things that might be like better in the long run, or in the medium run, depending on your timelines. So yeah, I guess that was like my kind of like intuition. And maybe if I want to like push back against your posts, or your comment, it would be like, maybe try to be optimistic, but still keep the like dignity part from Yukowsky.
Curtis: Yeah, I think that’s a fair kind of point. In fact, like one thing that happened after I made that comment is I had kind of a really long discussion with Connor, kind of following this. And one thing that we sort of like, discussed is really like, you know, I think in the post Yukowsky talks about dignity is like the logarithm of the probability of your success, right? And I think there’s really a fair point to be said there that like, for us to succeed, it’s not going to be because like a high variance, you know, a single high variance action was done. You know, I think he refers to those as like miracles. But you need a lot of little things to go right, and build on top of each other. And I don’t think I really disagree with that too much. It’s mainly the point that I’m getting at is the tone, right? And what you’re signaling to the people around you who are actually trying to reduce this, to reduce existential risk.
Curtis: And I think the tone might be like a huge factor in how we like manage to convince people to work on this. I think mostly on Twitter or on internet, people have been like commenting on this as being like a do more post or like the, even the word like doom or doomer, I think it might be like, negative, long term to be like a term people use as often as they do. And instead, we should like talk about, you know, like aligning AI, or building an utopia or like, maximizing our impact on the light corner, those kind of things. And I think like, people like tend to like see maybe like this post or a lot of the vibe, vibe there as like, maybe like too negative. ⬆
Updates Since The Die With Dignity Post
Michaël: So yeah, I guess like, some people on Twitter have been like asking, like, if you add like any updates since the, since last year, or last year and a half, like, did you, since you wrote that comment, do you have like any like new thoughts on this or like a new perspective? Or you’re still like, as motivated and willing to, you know, work very hard on this?
Curtis I think basically nothing has changed. Um, I do think that the situation has gotten actually like, significantly more pessimistic, even then, then, when when I did write that comment,
Curtis: Um, so I think that with the release of GPT-4, um, you know, there was there was a like a slice of worlds where OpenAI kind of said, Okay, let’s, you know, let’s pop the brakes.
Curtis: Um, yeah, there’s a slice of there’s a slice of worlds where like, OpenAI kind of says, Okay, let’s, you know, let’s stop releasing things. Um, let’s kind of close the gates. And let’s, you know, take actions to kind of slow down race dynamics. And right now, we’re seeing sort of the complete opposite of that, right? Um, you have, you know, GPT-4, you have a tremendous amount of AI hype. Um, pretty much all over the place. You have a lot of different companies that are now cropping up, trying to replicate what OpenAI is doing. You have a lot of venture capital kind of flowing into the space. Generally, if you’re if you’re trying to avoid this kind of, you know, the, you know, further acceleration of an AI race, this is not the timeline that you want to be in.
Michaël: So yeah, after the Yudkowsky post, you said that nothing has changed in the past year, or maybe things have been like, becoming even worse. ⬆
The Probability of an AI Extinction is 90%
Michaël: How pessimistic are you about, about AI being an extinction risk? And like, more generally, like, what are your like, timelines? How do you see the future in the next few years?
Curtis: If you had to ask me for a number, I think I would say like, we are 90% toast. And the reason I’m saying 90%, instead of like 99.99%, or something like that, is mainly because I do believe that there’s a great deal of uncertainty that I’m just not able to model. Factors that I haven’t considered, you know, places where I’m wrong.
Curtis: That’s like most of the kind of the, the thing that is carrying me to be like, you know, it’s it’s not 100%. But I really do think that the situation that we’re in is quite dire.
Michaël: So internally, when you wake up in the morning without taking into account your uncertainty, you think there’s like 99.99% chance of your dying?
Curtis: Maybe not like 99.99. But like, you know, pretty, pretty up there. You know, like 90-99% chance that we’re all toast. We’re really toast.
Michaël: Why do you think that? What was the like, reasoning or like evidence for this like belief?
Curtis: So I think like, you know, where I kind of disagree with, with a lot of other people is I do think that like, once we kind of understand how to produce intelligence, and once we kind of how to we understand how to do it very quickly.
Curtis: Or like, you know, we to do it efficiently is really is really the thing.
Curtis: I think that it there was there’s going to be a very large amount of proliferation of this technology. And I think that, you know, short of, you know, extremely draconian measures, or some other kind of mechanism for coordination. I think someone is going to, you know, someone is going to like not have the safety procedure in place. You know, like, you can talk a little bit about the pivotal act thing too. But then you also need to like, you know, trust that whoever is doing kind of a pivotal act is not going to, you know, lead to some negative outcome.
Curtis: So that’s like another multiplier in the in the chain of ways that things can go wrong. And that’s also like if we this is all assuming that we solve the alignment problem, which which I do believe we are not really in a good position at the moment to solve. So that’s like another kind of multiplier on where things go wrong. Another thing too, is that it don’t actually think like the alignment problem is something that you can really fully solve. I think that it’s like, you can go and you can align models up to a certain capability level. And then beyond that level, if you go too far, you will get yourself hurt. And so there’s like a there’s a there’s a chain of models. You know, where you know one technique works, and then you know you things stay aligned, and then the next technique works, and things stay aligned.
Curtis: And then you need to keep kind of inventing techniques to get over bigger and bigger gaps in capabilities between like you and the other agent or even just like you and future versions of yourself. If you’re trying to keep up with the capabilities of the systems that you’re going to be engineering.
Curtis: So that all kind of adds like other places where things kind of go wrong. Other things are just like, you know, general observations about how kind of things are shaking out. You know, guys will go and just put together things like chaos GPT and stuff like that. I remember like we did all of this like bickering about, you know, are we going to box the AI and if we box it, is it still got to be able to get out and blah, blah, blah, blah, blah. And then the way that everything shook out is really just that as soon as we had a language model that was good enough we plugged it into every API we could. Even API’s that you know we just made it generally I think opening I just made an update to make it generally able to be plugged into API’s by you know training it to format JSON’s or something like that.
Curtis: So like, you know, this is the kind of behavior where like it’s the timeline where we don’t survive is what it is.
Curtis: So you get that and you get like a whole bunch of other kind of similar failures and you add it all up and it’s like, okay, we’re almost certainly toast. There’s definitely room for me to be wrong about the whole situation, but it’s very much not looking good. ⬆
Best Counterarguments For Curtis’ High Probability of AI Extinction
Michaël: I’m kind of like curious of like if you have any steel men or.
Curtis: Yeah, so I guess there’s a couple different like kind of classes of counter arguments. So like one of them is that I’m completely wrong about the efficiency arguments. So like, you know, you look at the total compute that is necessary. And maybe you really do need like an absolutely enormous amount of compute to be able to do AGI. And as a result, like, you know, okay, you need a full supercomputer.
Curtis: You need 10,000 graphics cards and Moore’s law will just happen to peter out right when it is petering out a little bit already. We are kind of seeing that where, you know, we are able to run these things, but they are extremely technically. You know, it’s very technically challenging to run them at scale or in any way dangerously. So that could be like one kind of component where I’m wrong. Another kind of component where things could be, you know, where things could go well is just like, you know, maybe I’m wrong about the difficulties of alignment. And you can actually just get away with really simple tricks like RHF and stuff like that.
Curtis: And that, you know, turns out to be good enough that, you know, you don’t really need that much more than the current existing techniques or really anything more at all. And you get a system that just in general works well. So really, it’s, you know, those would be kind of examples that I would reach for is that, like, the ceiling on capabilities and the speed at which you can get to that ceiling is significantly lower than I thought it was. Or alignment just turns out to be significantly easier than I thought it was. And so that’s why it’s 90% and not like 99.99% because you have a 10% chance of this being true. Exactly. Yeah, there’s definitely like there could be a modeling error somewhere. And things work out to be significantly better. ⬆
Compute And Model Size Required For A Dangerous Model
Michaël: In terms of like, compute we need or like size of the models we need to get to something dangerous. So you’re saying that maybe the speed or like the ceiling is maybe wrong in your model. What do you currently predict for like, how big of a model do we need or when we will get there?
Curtis: If I had to kind of give a number, I would probably guess like something like, you know, the thing that you need is probably something really small, like 10 to the 10 floating point operations per second. You know, over the lifetime of a model 10 to the power of 19 floating point operations per second, which is really not that much, right? That’s like 100 teraflops… one 4090 for like, you know, three hours. Now you’re not going to be able to do it with, you know, 4090 in three hours with like current algorithms. You’d probably need to do like quite a bit of optimization and improvement and maybe you need like a slightly different computer architecture or something like that. But like, that would probably be like the lower bound.
Michaël: For people who are not into the deep weeds of training deep learning models, like how expensive it is to get like a 4090 and like, is it like something you can get for now? Or is it like something like top companies use?
Curtis: Well, I mean, like a 4090 is like top of the line gaming GPU. So like any gamer with like, you know, I don’t know exactly what the current prices are, but like it’s like one or $2,000 is able to get their hands on one.
Curtis: So yeah, this is a level of compute that is very much accessible to a large proportion of users or people.
Michaël: So when you say like the compute required to, you know, have something dangerous or something that could like maybe like disempower humanity, are you saying like for training or are you saying for inference? Because I feel like for training, that’s like not a lot of compute, right?
Curtis: It’s really not a lot of compute. But I still do think it’s actually like, you know, if you were to get the AIs from the future, you know, after we’ve kind of had a lot of room for optimization of these algorithms, I think you would actually get something that can go from like zero to human level in, sorry, it’s not three hours, it’s 30 hours. But, you know, in a day on a 4090, essentially, you know, using the best algorithms. I think in practice, like, it’s probably more realistic that we’re like three orders of magnitude higher than that.
Curtis: But like three orders of magnitude higher than that is still not that much, right? Like that is, you know, that’s, you know, 10 of them for like $10,000 worth of compute for 100 days, right? Which is very attainable for a lot of people. Or like one of these graphics cards for three years, right?
Curtis: And that is, again, that is a very small, you know, it’s not like exactly consumer level, at least right now. But, you know, assuming that we can still keep getting performance improvements and stuff like that, it will be very, very soon. So it’s really not that much compute. ⬆
Details For Curtis’ Model Of Compute Required, The Brain View
Curtis: I can go in a little bit more detail about like where I’m getting kind of those numbers and stuff like that.
Michaël: Yeah, yeah, please. Where do you get those numbers?
Curtis: So like, these are like really cheap heuristics. You know, part of it is just kind of like estimates of like how much compute the human brain uses. So like, you look at the human brain and it’s like, okay, estimates estimate that like each neuron fires maybe like once every 10 seconds, right? And there’s 100 billion neurons in there. So that’s like 10 billion neurons firing every second.
Curtis: And every neuron is connected to, what do you call it? You know, every neuron.
Curtis: Yeah, a thousand synapses. Yeah. And you say, okay, like each synaptic operation is one floating point operation. So you make this arbitrary equivalence. I actually think a floating point operation is more complicated than a synaptic operation. But, you know, you can about get that equivalence. So you have this like rough equivalence there. And that gives you an estimate of like 10 to the 12 or 10 to the 13 or something like that. Yeah, 10 to the 13, I believe, is the estimate you get.
Curtis: So you get like 10 trillion synaptic operations per second, which is 10 to the 13. And then basically like what I’m doing is I’m taking that number and saying, okay, well, you know, a human lives for like 10 to the 9 seconds over 30 years. You know, that’s about it. So 10 to the 9 plus 10 to the 30 or 10 to the 13 is, well, not plus, but you get the point. That’s 10 to the 22 operations. So your lifetime training compute is 10 to the 22 operations. And essentially what I’m doing is I’m allowing for like, you know, let’s say whether we find like a thousand X improvement in efficiency over what the human brain can do.
Curtis: Because we can express algorithms that the brain can’t express. We can compute in ways that the brain can’t compute. You know, we can, like the space of intelligences is presumably very large. And so that’s how you get to the 10 to the 19 number. So like if you make the assumption of like, okay, nothing, no algorithmic fanciness, you get like 10 to the 22.
Curtis: And if you assume like, you know, arbitrarily, like, we’ll get a thousand times more efficient than that. You get 10 to the 10. Which again, okay, like is kind of like numbers that you, you know, you pull out of thin air. But like, this sort of gives you a ballpark of things. And when you compare it to, you know, existing models, like I think like GPT-3 is like 10 to the 23. GPT-4 is like either 10 to the 24, 10 to the 25.
Curtis: And probably like, you know, frontier models going forward are probably going to be 10 to the 26, 10 to the 27 flops. You’re really into several orders of magnitude more compute than at least this estimate does. Now, your caveat to that is that like 10 to the 13 floating point operations per second is a very kind of lower end estimate of like what the requirement of the human, you know, the computing requirements of the human brain.
Curtis: Some guys put it at like, you know, 10 to the 16, which is three orders of magnitude higher. And that kind of puts you at like 10 to the 25 for total lifetime compute. Some people put it all the way up to 10 to the 18, which puts you at 10 to the 27. You know, it’s kind of all over the place. But I do think kind of the lower bound is probably correct for like, you know, weird, vague intuition reasons.
Curtis: And as a result, I do think that we are in a very hot water, I guess. Like we are in a very large hardware overhang. And it is simply a matter of either, you know, we get an AGI and then after that it goes and does a little bit of recursive self-improvement or, you know, something similar happens. And we are very much in a situation where there is mass proliferation of human level artificial intelligence. ⬆
Why This Estimate Of Compute Required Might Be Wrong, Ajeya Cotra’s Transformative AI report
Michaël: Okay. So I think I got the main reasoning behind your argument and kind of the main numbers. So you’re saying that domain uncertainty is about like how much compute is the brain doing? And you’re saying like you’re 10 to the 13 flops per second is probably wrong. Or maybe like other people have like other estimates that are like two or three orders higher. Yeah. Yeah. So like, yeah, some people will say, well, okay, a synaptic operation, you know, a neuron is doing a little bit more complicated stuff. Or really, you’re only looking at like the firing neurons. Really, you should be looking at everything.
Curtis: So, you know, if you do stuff like that, your estimates are going to go up. You know, depending on how much complexity you attribute to what’s going on. And all of that kind of, you know, cranks up the requirements, essentially.
Michaël: I still haven’t read AGI Cotra’s report fully, but I’m doing like a series of videos on it. And I’m looking at the graphs right now. And I think for the lifetime anchor, which is like how much compute is maybe like a human doing in terms of compute from light burst to death. I think the estimates point at, at least for AGI’s best guess, goes from like 10 to the 29 in 2025 to like 10 to the 27. Like after like algorithmic improvements and like other efficiencies. So your 10 to the 29 seems like much lower than like everything else, even like the most aggressive things.
Curtis: I can talk a little bit about that. So I think that there’s a couple of things that I disagree with in the report. So I think one thing that Ajeya did is that she disagrees with me very, you know, very much like a complete sign flip in terms of like how algorithms kind of play into things. So I believe that, you know, we’re not going to have too much difficulty, you know, finding human level efficiency algorithms.
Curtis: And in fact, we’re probably going to be able to do several orders of magnitude better than humans in terms of algorithmic efficiency. I believe that if you look at the text of the Cotra report, and again, you know, don’t quote me on this.
Curtis: Actually go check to see if it’s correct. But I believe there’s actually a multiplier of a thousand or three orders of magnitude in the other direction. So, you know, basically I’m saying, let’s go 10 to the 13 and then let’s multiply by, you know, one one thousandth to get 10 to the 10. And they’re going and saying, well, let’s go 10 to the 13 and then let’s go the other direction. Because it’s very hard to get like human level improvements. Yeah, because like, you know, the idea is maybe like the human brain and really brains in general have maybe had a lot of time to optimize their algorithms. You know, evolution has probably spent a lot of time kind of optimizing things.
Curtis: And as a result, like we’re not going to be able to in any reasonable amount of time match the quality and efficiency of human level algorithms.
Curtis: Another kind of thing that I believe the report did is that the lifetime anchor, when they initially had it, was predicting, or a couple of the anchors actually were predicting that we would already have had AGI by now. And I believe what they did is they did like a sort of a squishing operation where they took sort of the Gaussian estimates and they sort of like pushed them over because they said, okay, well, we haven’t seen it so far. And I think that like that squishing operation is the wrong way to kind of do things. What you should be doing is you should actually be cutting off the Gaussian and renormalizing.
Curtis: And that produces a much more sort of pressing, you know, like you see that already like a lot of the probability masses a lot closer.
Curtis: So I think, and again, you know, double check that that’s actually what it is, but at least that’s what I remember from kind of glancing at the report. And those kind of corrections, and I think there’s a couple other things, all kind of work together to lead to a significantly higher estimate. I will, however, note that I believe Ajaya initially predicted something like, you know, 2050 or something like that for median timelines. And that she has since revised her predictions down quite a bit. So, you know, I don’t actually know like what specific changes happened in her model that led to that sort of down revision.
Michaël: Yeah, if I remember correctly, in the revision, there was something about being able to like have AIs that do code for you. I think code coding was like a big part of like how much she thought that AIs will be able to generate value in the future. And so AIs could be like transformative sooner because of like that, like how important code is for everything we do. And not like how easy it was to do code right now. But yeah, again, like check the post for more details. And the thing you said about like squishing the distribution on the right, I think it makes sense if you’re like, if like in 2019, and AGI seems like very far away, to not like include models that predict AGI happening today. It was like, it was like high value if you think it’s like very far away. I think there was something about like not including models that would predict things much sooner. I haven’t seen the part where they like they moved the thing to the right for 2025. I think it would be good to have models to like have non-zero probability mass on 2033 or later.
Curtis: Yeah, I think this is actually like the case for me. So like I do have to like, you know, I’m not some kind of really good predictor. I have previously made very aggressive kind of timelines. And, you know, I said like, okay, maybe I think like in 2015 or something, I was thinking like, there is like a 15% chance that like by 2020, we would have AGI. And that turned out to be wrong, right? I lose base points. I lose predictive, you know, credibility because of that.
Curtis: And I think like when you kind of fully integrate that out to like 2023, where we’re at now, like it’s something like 30% or 40% or something from my initial like 2023 estimates. Sorry, 2015 estimates of when AGI was going to happen. So, you know, you can go to sort of look at me and say, well, like my timelines are actually like updating in the other direction where it’s like, okay, they’re actually stretching out.
Curtis: Another kind of like fun thing to think about when it comes to timeline prediction dynamics is that a consistent predictor will have their expected value of when the timeline changes slowly increase over time.
Curtis: So, you know, you can imagine like a toy model of this being like an exponential distribution of when you think AGI is going to happen. And an exponential distribution will have the property that, you know, it’s expectation, it always looks the same. You know, you see it with like radioactive decay of particles where like, if you have a particle and it doesn’t decay for like five minutes, then you still expect that the amount of time it takes to decay is going to be the same. Because every time you do the Bayesian update, you renormalize the probability distribution, you get back to the original probability distribution that you had. And so you do actually get this general effect where in general, you do actually expect timelines. If you’re fully updated and you’re fully calibrated, you do expect your timelines to sort of get longer slowly over time before they abruptly collapse when the event actually happens. And then all of a sudden your timelines are now zero.
Michaël: But you do have more information, right? You do have like more papers, more models being released. This is a very simple model of timelines, but yes, you’re obviously going to be updating on papers and new developments and all this kind of thing. That does kind of complicate things. ⬆
Curtis’ Median For AGI Is Around 2028, Used To Be 2027
Michaël: You said in 2015 that there was like a 15% for 2020 and now in 2023. So if you were to give your median 50% chance of Asia, or maybe you can give an estimate for superintelligence as of any, I like to call it now. When do you think there’s like a 50% chance of some superintelligent agents coming along?
Curtis: I think like if you’d asked me this, like maybe a month or two earlier, I would have said like 2027. So like maybe like four years from now. Now that I’ve kind of seen a little bit more information, like really recent information. I think that is kind of like my timelines have pushed back a little bit. Not very much. I think maybe like, you know, maybe a year. So maybe 2020. I don’t know, maybe add like six months to the timeline or something like that. So maybe 2028 is where I would put the median right now. Just because like some very recent information is kind of maybe down revised things. But what’s the information? I guess like I was expecting certain research to happen.
Curtis: And it didn’t happen. And that either means that that research doesn’t work, or it means that people are not really interested in kind of making that direction happen. Either way, that is kind of a positive update for timelines, because it means that in the cases where it is kind of a dangerous research direction, people aren’t really pursuing that. And in the cases where it is a dangerous research direction, people are, you know, it just doesn’t work and I’m wrong. And therefore, the path to AGI is a little bit more complicated than I thought. And as a result, we live in a world where we have a little bit more time. So I’m glad to know that the dangerous approaches that we will not name in this podcast don’t work. ⬆
How Curtis Approaches Life With Short Timelines And High P(Doom)
Michaël: How do you wake up in the morning thinking that there is a 99% chance or sorry, a 90% chance that you might die in the next five years? So if you multiply with the 50% chance for five years, it’s more like a 45% chance.
Curtis: I’ve kind of had the time to really think about this for much longer than most other people have. So like, yeah, when it was 2013, and things were kind of the way they were. And I’ve kind of started getting more and more kind of evidence about the direction that kind of things are going in. You know, like a lot of people are, when they when they see things, they see chat GPT come out, they see GPT-4 come out like a couple months later. And all of a sudden, they go from like, you know, nothing to Oh my god, we’re all going to die.
Curtis: But for me, it’s very much more been like a very gradual process where I’ve been like internalizing, okay, things are more complicated than I thought this problem is hard. I’m starting to understand all of the rough edges, you know, and slowly, but surely, my probability that we’re all going to make it out of this has gone down. And I think I’ve sort of like, at least partially made peace with that. So at this point, like, I guess it’s sort of like a normal thing. And it’s been like that for a long time. Whereas I think for a lot of people that are like, getting blindsided by this technology, it very much is a, you know, a punch in the gut of like, Oh, I had all these plans for what I was going to do 30 years from now and blah, blah, blah, blah, blah. And, you know, maybe that’s not going to happen.
Curtis: Whereas for me, it was like, okay, I’m probably got going to make it to 2030 is looking real far away right now. And it’s been looking real far for a while. So I think that’s kind of a big contributor to me, like not really being that, you know, negatively affected by it. Like there is a couple times where I’m like, you know, it hits really it did, you know, things get really real for a moment. I’m like, Oh, boy. You know, we could, you know, you get the real visceral fear that we were all toast. But like, I think I’ve had enough of those and those have happened in like, over a sufficiently long period of time now that it’s like, Oh, this is just like, yeah, we’re all toast. And we’re all toast probably pretty soon, unless I’m wrong.
Michaël: For me, that moment was like seeing everyone talk about it on Twitter more and more. And like, even like the US government talking about it. And I was like, Oh, it’s not like some obscure thing that people talk about on the internet. People like in the real world are talking about it for real. Like, at some point when you have like some like uncertainty about your own models, you’re like, Oh, maybe I’m wrong. Maybe I make mistakes. Like, I’m probably like, you know, like in the wrong direction. But if a lot of people seem to converge on the same belief, and it starts to like appear on TV, and your grandma calls you and be like, Hey, have you heard of this thing called chat GPT? It’s pretty good, huh? It feels more and more real.
Michaël: And you can like starting to feel like in that gut level. I think that’s like what Robert Miles was saying like in my podcast. Like, God is catching up with like what his model was thinking for a long time. And I guess when you were saying that, like, you were predicting this since 2013, it seems like in 2013, there was like not a lot of evidence, right? There’s like, maybe imagine that in 2012. Like, what was like other things were like, pushing you in the direction of like stuff happening fast?
Curtis: So I think really, like, it hit me that things were happening fast around maybe 2014, 2015. In 2013, like, X risk was this thing that I was aware of. And I was like, Hey, we need to be thinking about this. But I was nowhere near as pessimistic as I am now. Really, like what started to update me was like, I started paying attention to all the papers that were coming out from from really all over the place. And like this, this sensation that progress was really fast and accelerating has been pretty much constant and hasn’t really gone away for for for a very long time.
Curtis: You know, I remember all the way back in 2015, 2014, 2015, being like blown away at the at the rate of progress. And that sensation hasn’t really left. If anything, it’s almost felt like things have slowed down a little bit compared to some of the developments that were happening in the in the earlier years. ⬆
Takeoff Speeds—The Software view vs. The Hardware View
Michaël: Do you think we’re going to get like a fast takeoff with maybe like some fumes scenario with recursively self improvement? Or do you think things will progress slower than this?
Curtis: I think in takeoff time, I think there’s I think there’s kind of two factors, right? So like, there’s the possibility of takeoff, like a software takeoff. And then there’s a hardware takeoff. You get kind of different dynamics and assumptions depending on how you like make assumptions about the interplay between those two things. If I’m right about the like extreme potential for like efficient intelligence, I think that what you will get is you will get a, you know, everything will be mostly fine.
Curtis: And then you will get a very fast kind of software takeoff, where you know, the minimum requirements to run an AGI very rapidly drop over orders of magnitude, because, you know, you’re able to, you know, as you go down the number of orders of magnitude necessary to reach a given level of intelligence. You the rate of iteration sort of grows exponentially, because you obviously your experiments get, you know, get faster and faster to run. So I do expect that, like, if I’m right about the efficiency stuff that we could see, like, you know, these massive, massive models that require like 10 to the 26 or 10 to the 27 flops.
Curtis: Eventually one of these kind of comes online. Because, you know, maybe we’re just doing things in a really dumb way. And then, and then after that, we just have like this complete collapse as the system like undergoes very fast recursive self improvement so that it’s able to, you know, run itself on on very, very much much smaller hardware. And that could happen, I expect on the timescale of probably like, you know, maybe not days, but like, I could easily see something like a month or something like that as being a reasonable value.
Michaël: Things get more complicated if I’m wrong about the about the software efficiency. In the case where you actually have hardware bottlenecks. Now you’re no longer in kind of the world of bits, you’re you’re in the world of atoms. And this is where like, things kind of get a little bit more fun to think about it in different ways. Because now you’re no longer thinking about like, okay, how do we like lower the flops count of intelligence, it’s more like, how do we get more flops in general.
Curtis: And the question of how you get more flops is one of like manufacturing cycles of, you know, total available energy of, you know, replication rates. And this is where you kind of get into like the the fun kind of like spectrum of like, well, okay, do we get like humans to go and build the factories to make more of the of the stuff that we care about? Or like, is this kind of a thing where, you know, you’re going to get macroscopic robots building factories that we don’t really understand? Or is this kind of the case where like, you know, Yudkowsky is right about everything. And we get, you know, bacteria that replicate on the air, you on the order of like, you know, 10, 10s of minutes or hours or whatever. And, you know, after an initial like, slow kind of progression to better and better sort of nanotechnology, that you that you eventually just get really, really fast replication rate.
Curtis: But like kind of before then things are sort of under control. And we can have a good understanding of how kind of things fit in. My view on takeoff is kind of a function of those two things. Like how, how do things work in the software world? And how do things work in the in the hardware world? And there’s a lot of uncertainty on on both of those.
Michaël: I think for the software world, what you say doesn’t really apply for this current world, where you have one company that trains, very large models, maybe a few companies, let’s say four or five training, very large models. And training takes a long time, let’s say weeks or months,
Curtis: And if we were to have recursively self improvement from efficiencies that are discovered, I would kind of see it as open source people doing kind of experiments on some kind of chaos GPT or some some stuff online. And they won’t have access to all the computers big companies have. So I feel except if big companies were launching a program being, please find new improvements in how to train more efficiently. And you’re allowed to interact with your own hardware or server to do more stuff. ⬆
Nvidia’s 400k H100 rolling down the assembly line, AIs soon to be unleashed on their own source code
Curtis: Well, let me flip that around and say like, given that there’s, you know, I think NVIDIA is aiming for something like 400,000 H100s rolling off the assembly line every quarter. And there’s a lot of companies that are making big orders for like, you know, 20 to 80k H100s.
Curtis: What is your view that someone isn’t going to go and just say, especially given kind of the previous circumstances where we’ve said like, you know, we’ve initially we were talking about boxes and the way that that actually turned out, you know, what makes you think that you aren’t going to get, you know, these AI systems effectively being unleashed on their own source code once they’re, you know, sufficiently capable? And being explicitly instructed to make themselves as powerful as possible as fast as possible?
Michaël: Because all the companies ordering like hundreds or thousands of H100s are going to try to research how to make their stuff more efficient and like run agents or run stuff in a loop. Like, what is the argument here? ⬆
Could We Get A Fast Takeoff By Fuly Automating AI Research With More Compute?
Curtis: Say you get the first, you know, AI that is as good as an AI researcher.
Curtis: And it takes like, you know, you have maybe like 100 people in the, you know, a couple hundred people in this company that are that are working on this thing. And you have like, you know, you train it on a cluster of like 10,000 H100s. Say,
Curtis: Initially, you have like 100 people that are that are that are working on doing these optimizations.
Curtis: But as soon as you kind of get to this human level, you have like, you know, now you have significantly more throughput. You can actually kind of dive into the math of like, okay, exactly how many, you know, what is the like equivalent of like simultaneous people? And it depends on like assumptions about how big you make the networks and stuff like that, and how fast you can do inference and all these kind of things. But like, I think there’s going to be a lot of room, like you’re going to get a lot of, you know, after as soon as you do like the first training run, you’re going to get a lot of room for a very rapid self-improvement, just because you’re going to have, you know, quote unquote, a lot of, you know, machine power that you’re going to be able to redirect at the task of kind of optimizing the code base. And yes, there’s like, you know, there’s fundamental kind of limitations there. But this is this is where like, the question of like, what is the theoretical software efficiency? And how much time does it take to really get there in terms of like man hours? ⬆
Substituting Human Labor With Compute
Michaël: Yeah, I think it’s a huge assumption as we will get some AI capable of doing everything and AI researcher can do. Or at least, that’s further down the line, maybe it’s in more than at least three years, possibly five to 10, right? And, or we can discuss in precise numbers, but I think it’s very significant. The real argument is that we’re going to get something between 0% to 100% of, can do everything and can do everything. And there’s, there’s 1% or 10% thing that the AI is not able to do and the human will need we need to, jump in and do some kind of things to help in the physical world or, file some document for Amazon AWS to prove that you’re a human.
Curtis: Depending on your model of exactly how that human labor fraction kind of this, the case is going to tell you whether or not you get a slower or a faster or a harder, you know, maybe not, you may be not fast and slow, but hard and soft take off. Because, I think that there’s a really big difference between a human being there needed for 1% of all the work and a human being there for, you know, 0% of the work. And it also is a function of like, the amount of compute that you that you have at your disposal.
Curtis: If the AI is helping, you know, humans, but like that 1% where the human is there is actually a bottleneck. And so you’re only really able to throw like 1% of the computational resources to actually accelerate your, you know, your researchers. And I do believe like with Chachapki and stuff like that right now, we’re, you know, we’re well below that, you know, level, like, if you look at like, how much compute open AI researchers are soaking up, just to like, you know, aid themselves, I imagine it’s a tiny fraction of the amount of compute that they actually have available to them. And yeah, maybe, if you get a soft transition, it’ll probably be because you’re sort of saturating that compute. And you’re you’re doing AI research as fast as you can. And, you know, and more and more, you know, when I say that, I do mean stuff that’s less non trivial than big training runs.
Curtis: I mean improvements to the source code, actual, you know, time being spent, you know, writing code and making improvements and stuff that. But yeah, I do think that, it’s a very tiny fraction right now. But if we live in a world where, open AI manages to soak up all of their compute helping, you know, helping the humans and the humans aren’t really a bottleneck. But there are still necessary. Then yeah, I think you can make an argument that you get a more smooth takeoff.
Michaël: I guess the main argument I was trying to aim at is, unless people are acting really dangerously and trying not very moral things, like trying to build a self-improving AI or trying to build an AI scientist without sandboxing it, then we might not get it. It seems like with the super alignment posts or, I don’t know, Entropic and even DeepMind, they have this concern about alignment, and so maybe they won’t run those experiments just yet.And even DeepMind, they have this concern about, alignment. And so maybe, maybe they won’t run those experiments just yet. And maybe we maybe the open source people will just run those experiments three years later when they have the compute, right? ⬆
The Entire World (Tech Companies, Governments, Militaries) Is Noticing New AI Capabilities That They Don’t Have
Curtis: This is why I bring up the large amount of H100s that are being manufactured right now is that there’s more than just open AI and Anthropic and DeepMind now. The race has been started. You know, like Inflection is building a cluster of 22,000 H100s. You know, there’s rumors of like, you know, some people doing a cluster of like 80,000 H100s or something like maybe not a cluster, but like an order that big.
Curtis: You have, you know, you have the major cloud providers, you know, Microsoft is not, you know, sitting around idly either. And they’ve kind of explicitly said in their like sparks of AGI paper that they’re kind of aiming for it for agency and self improvement. And then you have like, you know, you have other actors, you have state level actors, you have, you know, you all the military guys now kind of getting into it and saying like, hey, maybe we should start racing with China. And then you have like, you know, the Aurora supercomputer guys saying like, hey, we’re gonna work with Intel, we’re gonna make a trillion parameter model and we’re gonna, we’re gonna use our big supercomputer with like, you know, 60,000 Intel GPUs.
Curtis: To make that happen. I really do think that like, four years from now, you’re going to have a very large amount of actors that are going to have a rather non trivial amount of compute. And, you know, open AI and deep mind and anthropic, you know, might be in the lead. And they might be able to like, you know, be smart about not just like opening up the throttle and letting it rip. You know, they probably have the internal discipline to not let that happen. But, you know, if they have the discipline to not let that happen, I do expect that somebody else with a very large amount of GPUs is going to step in and pull the rip cord, basically.
Michaël: I think that makes me much doomier than I was before our conversation. And shortened my timeline by a bit. ⬆
Open-source vs. Close source policies. Mundane vs. Apocalyptic considerations
Michaël: At EleutherAI, you’re kind of like doing more like open source research or open source software. I don’t know how much you’re doing open source things. But some people are like, I’ve been asking me like, if you had like any thoughts of like an open source versus like closed source, open resources, like closed research for those kind of things.
Curtis: Yeah, sure. So I guess like there’s like, there’s mundane considerations and there’s apocalyptic considerations. You know, that’s kind of one way to put it. And I kind of have different opinions on both of them. When it comes to kind of the mundane considerations, you know, you have pros and cons of going for like as much transparency and open source as possible, right?
Curtis: Like, you know, on one hand, you have, you know, once you have a fully transparent model and you kind of know what, you know, what it is, how it was trained, what went into it. You know, that’s really good for auditability. That’s really good for, you know, just kind of understanding like why, you know, obviously we’re terrible at interpreting transformers right now. But like, you know, presumably that’ll get a little bit better over time. It’s just generally useful to have some level of auditability into like how the model was trained. At the same time, you have like other kind of, again, mundane kind of concerns, right? Like, you know, how do you handle personal and private information?
Curtis: If you have an open source language model that is, you know, during its training kind of hoovered up some information about a certain individuals and effectively sort of like synthesized a lot of that as part of its training, you know, how do you handle that? So, yeah, that’s kind of like the mundane stuff. But I think probably the more interesting thing that you want to talk about is the apocalyptic considerations. And there it’s a little bit more. Yeah, yeah. And there it’s a little bit more. It’s a little bit more tricky. I think that probably what you want is really you want sort of a sort of a kind of a by maybe not like a bimodal thing, but you want a policy where there’s a ceiling on what you’re willing to, you know, the level at which you’re willing to kind of open source.
Curtis: You know, small models, having them open source, you know, it makes it possible for, you know, companies, individuals and everybody else to kind of study them. But really, like once you get to the really dangerous systems, the systems that, you know, that can, you know, end the world. And even like, you know, you do want to have like a buffer margin, essentially, between the systems that end the world and the systems that don’t. You really don’t want those to be open, because like if you have something that can destroy the world and you distribute that to everybody, well, you know, that’s just that that ends the world.
Curtis: It’s not that complicated. So share the models that can help us get a better understanding of how neural networks work or how customers work. But whenever you get to like very dangerous models, maybe maybe try to like not not share it on a HCI.exe file. Well, I would even go further than that and say like those models should not exist. Like if you’re making a system where if it leaked out, you know, it would be a complete disaster. You should you should have that and you should have that for two reasons. One is that your own internal security is only going to be so good. And there will come a time where there will be a security incident and those weights will get leaked and you will lose control of the situation. The second is that as soon as something like that exists, as soon as something like GPT-4 exists or GPT-3 exists, and it’s behind a closed door. People do not trust each other enough to be able to like have one single party trusted with control over a very powerful AI system.
Curtis: Because you can be cut off from you can be cut off from that capability. There’s also like just like, you know, a lot of people will ask like, OK, whose values get represented in the AI? They will go and say, well, OK, you’re you know, you’re doing this with it. I don’t trust you. I want my own that has its own values and lets me do this, this and that. And just the fact that you have it and just the fact that it exists is going to motivate a whole bunch of other actors in the space to try and and do a replication effort. And that’s effectively what you’re seeing, or at least in my opinion, that’s what we’re seeing a lot of right now.
Michaël: Those kind of like concerns we’re raising about open source models, is it related to like things like LLAMA where people are accelerating on like making those models like better, more efficient, bigger? Or are you talking about something else?
Curtis: I think that’s definitely part of it. It’s more than just, I would say, like LLAMA. It’s just the general like, you know, every other country is kind of noticing these things. You know, the US government is noticing it. Everybody, everybody sees that someone has a capability that they don’t and they will make efforts to try and close that capability gap. You know, whether or not they succeed is another thing, whether or not they have the budget to be able to do it. The thing, though, is that the motivation is now there. And that itself is concerning, at least in my opinion.
Michaël: Yeah, I agree it’s concerning that governments are paying more attention to it. And yeah, definitely, if you’re paying attention to it and you’re listening to this podcast, you’re doing the right thing. ⬆
Curtis’ background, from teaching himself deep learning to EleutherAI
Michaël: People might be interested in you. What’s your background? How you got into this whole like, exoservice from AI thing or like deep learning thing? Where did Curtis learn to do like LLAMA work, like deep learning research? I guess on my end, my background kind of is a very informal background, but it goes back a long ways.
Curtis: So like the first neural network that I wrote or kind of machine learning I wrote or algorithm I wrote was like back in, I think, like 11th grade, like almost a decade ago. So like, I think I was really kind of introduced to Yudkowsky’s writings probably around like 9th or 10th grade or something like that. So probably 11 or 12 years ago, something like that. 2011, 2012. I do actually remember giving a talk about like effective altruism and existential risk back in like 2013 at like a small gathering in a place in Canada called Saskatoon in Saskatchewan, which is the middle of nowhere. Is this where you’re from, Canada? Yeah, I’m from Canada, originally Edmonton.
Curtis: But yeah, essentially. Yeah, I have been kind of following this for a very long time. And then since then, I’ve been kind of reading papers and playing around with just small neural networks, doing little experiments here and there. I did go on to kind of do some internships in bioinformatics. But the extent of my formal training is not actually that large. It’s all sort of been self-taught.
Michaël: And I guess that’s the best way to learn how to do this thing. Everything is so recent, right? Did you first learn how to train big models while helping out with those GPT-Neo, GPT-NeoX, or like Powell Research Projects at EleutherAI? Or did you arrive later?
Curtis: I actually didn’t have much of a hand in the Neo and NeoX models. But I definitely did pick up a lot about distributed training just from hanging out in the EleutherAI Discord. I learned quite a lot about everything from how the low-level kind of the GPU operates and the importance of memory bandwidth and vectorization, all that. All the way up to kind of the various parallelism strategies that we use in large-scale modeling or large-scale training today. ⬆
Alignment Project At EleutherAI: Markov Chain and Language Models
Michaël: Yeah. And now you’re working with EleutherAI on alignment projects, right? Can you give an overview of the different projects that are currently being worked on?
Curtis: I can talk a little bit about a few of them. So one of them that I’m kind of excited about is a collaboration that we’re doing with the AI Safety Initiative at Georgia Tech. And we’ve kind of got two sort of projects that are going on over there. So the first one is looking at language models kind of as a way to kind of understand what the language model is. So like the first one is looking at language models kind of as Markov chains. And I don’t know how, I guess like how familiar you are with Markov chains. But basically the idea is that you can sort of view a language model with a limited context window as sort of having a state. Where the state is the entire context of that language model.
Curtis: And then you can go and ask yourself, okay, what is the transition distribution from like one state, so one context to the next state? And you can look at it and say, well, okay, we add one token by sampling. And so the distribution from the one state to the next is the log probabilities outputted by the language model. And then after that, we pop off the last value from the state, like the last token that pulls out of the context window. And when you take that kind of view of what the language model is doing, what you can do is you can go and say, okay, what are, you can ask yourself questions like, what is the stationary distribution of this Markov chain?
Curtis: If you run the language model auto aggressively, continuously generating new tokens, and then sampling and producing, what is sort of the distribution of text that you’re going to see kind of in the long run? Another question that you can ask yourself is like, what is the quote unquote reverse Markov chain or reverse process? So if you wanted to reverse the dynamics in time such that you append a token, and then you pop off the append a token at the back to kind of pop off the back token, what kind of distribution do you get? And it’s kind of like, the first question you ask yourself is like, what does this have to do with alignment? And I guess like, from my perspective, it’s mainly that it is a largely unexplored domain, and seems highly relevant to just understanding language models as we currently use them. The reason for that is because like, when we train language models, we really do use sort of like the next token prediction objective as the base. Like obviously, there’s RLHF that comes after that.
Curtis: There’s supervised fine tuning that we do. But really, a lot of the meat of the training is all focused around this sort of next token prediction objective. And, you know, there’s a little bit of reason to believe that like, when you kind of run a language model for an extended period of time, or really any sort of model that has been trained on this kind of next token prediction objective, you will kind of get a little bit of error that piles up. You know, eventually, like the model will mispredict something. And then now it’s input distribution is a little bit off from what it was kind of initially trained on. And then, you know, things kind of start to snowball from there.
Curtis: Another kind of example of like a good, of like where this sort of failure mode is already sort of manifesting itself is, or at least like this, you know, this is me speculating. But Bing, when it came out, initially sort of you could keep talking with it going back and forth as much as you wanted. And it pretty quickly kind of started going off the rails. But the solution to that was, okay, let’s go on, you know, limit the length of the conversation so you can have a Bing. And, you know, obviously, I can’t say exactly why Microsoft did that. But I do suspect that at least part of it was because of the model kind of starting to go off the rails. I guess like that would be sort of like an early manifestation of the kind of worries that kind of I have in mind when it comes to like this. Yeah.
Michaël: Are you trying to like see what is the like stationary distribution of that Markov chain of like the where does the model go if it like starts to like output like very, very long paragraphs of text. And at some point it will like reach this like bad behavior. Are you trying to like get to this like bad behavior at the end?
Curtis: Not really. We’re just trying to see like we’re not even going that far. We’re just trying to say like, how does, you know, how like where does it go at all. And then, you know, questions like what is sort of like the mutual information like how does, you know, does it it starts with one topic. How does that kind of change over time. Do you get like a distribution shift and sort of, you know, how long can you stay on the same, you know, same kind of track, so to speak. So just just these really basic kind of low level questions is where we’re at.
Curtis: I think recently the focus has actually shifted from that and kind of asking on the on the reverse side now. Like if you start with like a bad, you know, kind of a bad output or a bad state somewhere that you don’t want to be. Can you sort of like work backwards to say like what was the chain of events that led you to that bad state according to the probability distribution specified by the language model. So that is sort of like, you know, it’s a very sort of exploratory kind of direction. There is if you have to ask me like, you know, ok, how is this specifically going to solve like one of the alignment problems.
Curtis: I’m going to have to say like, you know, it’s not that that is not where we’re at. But I do definitely. Sorry, go, go, go finish the sentence. I do definitely think that like having more lenses and kind of perspectives to be able to understand these systems is going to be useful in general. And that maybe we’re going to get to like another part later down the line where we say, oh, well, this lens is useful or this is a precursor to a training technique that would mitigate some failure motor and other. Yeah, I definitely agree. I agree. It’s like it’s good to do specific experiments to inform bigger theory on alignment and or other projects that could help focus on specific things. ⬆
Research Philosophy at EleutherAI: Pursuing Useful Projects, Multingual, Discord, Logistics
Michaël: I think it makes sense to zoom out a little bit and maybe for people who don’t really know what’s the main goal of the AI team is. Because I think some people might think of academia as having professors trying to publish papers or doing research for a lab or something. And in some sense now, I think it is a non-profit. Maybe some people donate and then it looks like research in some sense. Maybe an overview or introduction for how do you do research in AI Alignment. What does it mean? What’s the goal? And what is it like to open source stuff with other people?
Curtis: Yeah, so I guess I could speak just generally about Eleuther because we have alignment and we have interpretability and stuff like that. But really the lines do get blurred. Really Eleuther tries to do stuff that is basically we think is net value. And that can be anything. If we think that there’s some interpretability research or some paper that makes sense to publish, we’ll do that. If we think that there’s an alignment problem that it makes sense to tackle that other people are not really spending a lot of time on, we’ll focus our resources there. And it also extends to more mundane things. One thing that I’m less familiar with myself, but is also kind of part of Eleuther is multilingual work. So a lot of current language model development is very much focused on English. That is sort of the dominant language for these language models.
Curtis: And there’s also a whole bunch of other languages that don’t get the same level of attention. And so at least partially Eleuther does see a really valuable opportunity there of like, hey, this is the kind of thing that really isn’t going to increase our knowledge of how to build AGI and shorten timelines. But it’s definitely something that is going to kind of bring mundane and valuable utility to communities that otherwise wouldn’t have it. So the general vibe there is like, we’re going to make sure that we don’t make the situation worse. And if we see something that we think is a good idea to pursue and to spend our time and resources on, we’re going to go for it, whatever that might be.
MIchaël: From what I know about the multilingual project, there’s a guy leading a team every week or in the past year or so doing Korean language modeling or something. And correct me if I’m wrong, I might be wrong. And I guess he started by doing the thing by himself and having people come and then they started the project. But it’s very informal at the start and now it’s maybe more organized. If someone listening to this or watching this wants to work on alignment projects with Eleuther AI or just contribute to open source, is there any best ways to reach out to you or get started? Do you have shovel-ready stuff to do, like markovchaining things? Or is there other projects that are easier to get started with?
Curtis: Yeah, we do have an email. You can email us at contact.eleutherai or just Curtis at Eleuther AI if you’re interested in talking to me specifically. But generally, the primary method that everybody gets involved with is through Discord.
Michaël: And for this particular markovchain project, how many people are working on it? And is there a need for someone else to do something?
Curtis: I think right now we’re pretty full on that project. I think there’s like five. I can’t remember exactly the exact number, but I think it’s like five people. And they’re kind of working at it. And this is from a collaboration with the AI Safety Initiative at Georgia Tech. So they’re at least partially handling the management and stuff like that, at least that project.
Curtis: But just in general, the way Eleuther, especially the Discord itself, is actually structured is that we have a whole bunch of project channels. And so what I would recommend if someone wants to get involved is just take a look at the project channels and see what people are talking about. See which research leads are sort of active in there. And then either make a general post in the channel itself or go and DM whoever the active researchers are for the project. If you want to get involved. And even just asking for the specific project is probably the best way to kind of do it. ⬆
Alignment MineTest: Links To Alignmnet, Embedded Agency, Wireheading
Michaël: In terms of leading, you’re the head of the alignment MineTest projects, right?
Curtis: Yes, that’s correct.
Michaël: What is the alignment MineTest project and why is it useful?
Curtis: Yeah, so the high level kind of like long term goal, there’s like a lot of kind of like intermediate sort of directions that we find is kind of interesting. But like the main thing that I want to study is, I guess you could say it’s like embedded agency failures in a toy sandbox. So like, what do I kind of mean by that? If you look at like, if you just punch in reinforcement learning into Google Images, you will get this diagram, the same diagram, a bunch of different variations on the same diagram that has like the agent on one side and that has the environment on the other. And there’s this like really solid boundary between the agent and the environment. And you have actions going from the agent to the environment and you have observations and rewards going back to the agent. Right? Unfortunately, this is not how the real world works.
Curtis: Agents in the real world are embedded in that world. Right? And a consequence of that is that you like architectures and designs that would work fine in the standard sort of reinforcement learning setting are not necessarily going to carry over to the embedded setting. And there’s certain kind of failures that we expect to be able to see and demonstrate when that happens.
Curtis: Like the really classic one that is less relevant because of the current techniques that we have, you know, RLA, JETMF and stuff like that. But it’s still kind of something that is worth kind of keeping in mind as an example is wireheading, essentially. And this is essentially what happens when a neural network or an agent is trained with reinforcement learning and the reward signal. And it gets sort of smart enough to realize that the thing that it is getting reward, like it’s not being rewarded for, you know, for whatever it is that we are giving it reward for. It’s getting reward because the little reward circuit is being activated. Right?
Curtis: Or like, you know, you could imagine like a setup where like a human has a robot and, you know, the robot is kind of going around. And, you know, when the robot does stuff that we like, we press a button and we give it some reward. Right? Well, eventually the robot’s going to get smart enough to be like, hey, wait, what happens if I press the button? Right? That gets rid of the human and starts pressing the button itself. So it gets infinite reward. Right? So that would be kind of an example of like the sort of an embedded failure. That is a very simple example of the kind of failure that we’re looking for. There’s significantly more kind of complicated failures that you can get with like really any sort of value updating scheme.
Curtis: And that is sort of the long term goal of the Mindtest project is that we want to be able to sort of sandbox these scenarios with an AI that is like ideally like sort of as weak as we can make it to still sort of isolate the sort of embedded agency failure that we’re trying to target. So that we can then like understand the failure so we can, you know, sort of get some idea of like, OK, how does like the value function for an RL agent evolves when it starts to notice that there’s this kind of failure mode happening. And then like once we’ve kind of isolated that failure, then we can start to ask ourselves, like, what kind of techniques can we develop to mitigate or eliminate that kind of failure? So having some kind of like RL agent in an environment like Minecraft being embedded in the world and having a failure mode where he starts hacking his reward, assuming his reward was like inside of Minecraft. Well, that’s exactly it. Yeah. So the Mindtester project, one of the like motivations for using Mindtest instead of like Minecraft or existing models is because like we wanted to have as much flexibility in terms of like, you know, programming things or implementing different, you know, implementing stuff in the environment.
Curtis: So you could actually see like, OK, let’s put a button in the environment and let’s have a player control the button and let’s have the agent also be in that environment. And let’s have the, you know, let’s actually just do the experiment and see like, you know, under what circumstances does the agent kind of wise up and say like, you know, hey, I can maybe just get rid of the other player and I can press the button myself. You know, this is kind of the type of failure that we want to be able to investigate, that we want to be able to replicate and that we want to be able to mitigate in the long run. Obviously, the project is not quite there yet. And sort of like we’re going on, you know, kind of intermediate tangents.
Curtis: So I believe like in the first blog post, what we mainly outlined was, you know, we trained a basic PPO policy to kind of punch trees. And we said, OK, well, this is a really simple policy. Can we do some basic interpretability on that? What does the, you know, what does interpretability on an RL policy look like? What does it feel like? Do we kind of bump into problems that we don’t really bump into in kind of other, you know, other kind of environments or like other sort of training settings?
Michaël: And there’s a little bit of that that we’ve already sort of run into and that we’re kind of starting to think about, like, what is the best way to kind of get around these sort of limitations? Like a good example of this would be like you incentivize the model to go and punch trees, right, or punch wood. And one thing that you would expect as a consequence of that is that the model would go and punch, you know, punch all of the logs in a tree. And that would be that. But what you actually end up seeing is a situation where the model kind of learns a strategy where it doesn’t punch out the bottom log. And instead it punches out the log above it and then sort of hops on and then is able to access kind of more logs at the top of the tree. So like, if you’re kind of a reward designer and you maybe want to go on like get rid of all the entire tree, that is sort of like an unexpected or unintended kind of consequence.
Curtis: One of the things that like I am currently thinking about is what is a good way to sort of detect these sort of unintended consequences of the, you know, the rewards or the feedback that we give. And, you know, how does that translate? Like right now, obviously we did basically like reward shaping and, you know, manually specify reward function. But like, what is the analog of that with something like RLHF? So these are all kind of like questions that are sort of hovering around in my mind. ⬆
Next steps for Alignment MineTest: Focusing On Model-Based RL
Michaël: And so in your project, you have this thing with the log when there’s like an expected behavior. What’s the next step? What’s the next thing you’re trying to observe?
Curtis: So right now we’re kind of, you know, putting aside the PPO policy for a little bit. And we’re going to try and focus on like model based RL. And, you know, kind of the next step to enabling that is just training models, generative models, and seeing, you know, what can we understand about those generative models? Because like, you know, there’s language model interpretability, right? There’s vision model interpretability. There’s, you know, RL policy interpretability.
Curtis: And each of them kind of has like their own kind of unique challenges and unique strategies and stuff like that for getting places. And I do expect that like, you know, a video model or like a model that actually also has like action outputs is going to like represent, you know, I expect to see things like causality. I expect to see things like tracking state, like if you see a tree, and then you turn to the left, and then you turn back, presumably there’s going to be some machinery inside that generative model that is tracking the state of like whether that tree is there. So questions like, okay, how do you look at like state tracking inside of, you know, inside of a video model or a video model augmented with actions? How do all these kind of things interact with each other? That is like the next step that we’re taking with the MindTester. ⬆
Training On Human Data & Using an Updated Gym Environment With Human APIs
Michaël: Is the basic idea to like see if the estimates from the agent of like how much reward he’s going to get in the future, like gets like higher when he sees like a log? Or like when you see something like when he detects something in the video, like seeing like how the value function he has like changes over time with like different inputs?
Curtis: So that’s like one direction. I would say like it’s even a little bit like, like that would be kind of later down the line. We’re not even really like, we’re not really going towards the full model based RL yet, right? The way that we’re kind of going to approach this is we’re going to actually just have users, you know, play the game. And then after that, we’re going to take that data along with like, you know, some pre-training data and just build some basic models of the environment. And just saying like, okay, forget about value functions, forget about policies. Let’s just focus on like this one subcomponent of the agent and what can we get out of that?
Michaël: On the website, I think it says that you’re like trying to build some gym, gym like environments. Is it done or is it still in the weeds?
Curtis: It’s out of date. We have the environment. You know, obviously we needed the environment so we were able to actually train it. The next kind of step, and it’s not that big of a change, we’ll probably get it done pretty soon, is that we’re planning on collaborating with the Forama Foundation. And they maintain an updated version of the gym environment, along with like, I believe, like some multi-agent APIs. So kind of another kind of place where there’s work to do and is also on the roadmap for the project is support for sort of these more modern APIs, essentially, and just swapping out like the old OpenAI gym API. But yeah, we’ve had an environment working for quite a while now. And we’ve used it to train some agents and we’ve done some interpretability on the policies that are linked. That’s been, you know, the website is a little bit out of date.
Michaël: If I’m a deep learning or RL engineer, listening to this, and I’m kind of interested, what kind of compute do you have for this? Or do you just, how big is the model? Is it just a standard size model for this kind of video processing and to, you know, have PPO on top? Or what kind of model are we talking about?
Curtis: It’s tiny. It’s a very tiny model, the one that we trained. It’s like, I actually I think we outlined it in the blog post. It’s like three convolutional layers, a fully connected layer, and then like an action and a critic head. It’s a very tiny neural network. Scaling things up, I do imagine is going to make the interpretability significantly more challenging. I think like one thing, like another kind of thing that came out of it is like, you can have a functional policy.
Curtis: It’s relatively difficult to like, especially as you as you kind of scale things up, you get like more noise in the system. Like one thing that we noticed is like the learn policy is really not that symmetric. But it works anyways. So like, you know, you’re thinking, well, you know, the environment is mostly symmetric. You know, maybe there’s a little bit of a change with like, you know, the textures being oriented one way instead of the other. But like, it’s not enough to explain what’s going on. And I think it’s really it’s just like an initialization noise and RL noise and all these kind of things that are sort of conspiring. And the net effect of that is that you can get a very noisy, very not great policy that doesn’t have very clean internal representations. But it will still get the job done. It will still kind of execute, you know, execute a policy that actually does work in the environment.
Michaël: I was going to make the joke that maybe in the Minecraft environments, because the character has like a sword or like a tool on the right, maybe that’s making like the image not symmetric.
Curtis: Oh, you see, you can test that by like mirroring the yeah, I mean, that’s actually you know, that could that could actually explain it. Because obviously, like during actual training, you know, that’s not what’s going on. So that you know, that could actually be what we’ve seen. But yeah, that is just generally we just haven’t seen the symmetry. So that could be that could be what’s going on. ⬆
Another goal of Alignment MineTest: Study Corrigibility
Michaël: Another thing that you mention in the blog post is that the main goal is to understand corrigibility. So how to make models more likely to be corrected by humans. Can you say more about like what corrigibility is and why you’re interested in these projects to like help with it?
Curtis: Yeah, so corrigibility is the ability to well, I mean, as the name implies, to be able to correct agents and models after they’ve been deployed. I guess even during training, depending on like what assumptions you make. But really just being able to say, hey, no, don’t do that. You know, I want you to do like Y instead of X, And like, you know, in the in the RL case, you know, you get a really serious corrigibility failure where the agent, you know, in the button example, going back to that, the agent going in and just like, you know, getting rid of the human and then just pressing its own button. That’s like a failure, but it’s not really a corrigibility failure.
Curtis: A corrigibility failure is one where like the agent wants, like say you instructed to go cut down some trees. The agent wants to go and cut down some trees. And maybe you change your mind and you say, like, well, I don’t really want you to be cutting down trees anymore. I want you to be doing something else… A corrigible agent will let you go and modify its source code or like, you know, give it whatever, you know, use whatever reward mechanisms you have or whatever mechanism really you have to edit the model to change that behavior so that it stops doing the thing it’s trying to do or it’s originally trying to do and kind of moves on to doing something else.
Curtis: And this is a little bit like unnatural because, well, if you want, if you make an agent that wants to cut down trees, it’s going to naturally reason, well, if the human changes his mind and doesn’t want me to cut down trees anymore, well, maybe it’s even smarter than you and it’s sort of like sees this happening before you even, you know, are aware that you’ve changed your mind. Maybe it’s going to go and like throw some obstacles in the way or maybe it’s going to deceive you. Or maybe it’s going to do like a one of a many, you know, a bunch of different things so that it can keep doing its tree cutting and you’re sort of, you know, cut out of the loop, essentially, of deciding what it does next. And so a corrigible agent is one that has that, you know, at least in my kind of understanding, is one that has that property of, you know, it will go and if you change your mind, it will go along with your change of mind.
Curtis: In some sense, like the agent is like trying to do what is best for you and like tries to like check if you want to turn it off or not. And he’s not trying to like prevent you from seeing what’s going on. And I think it’s like pretty, pretty hard to get. I’m not sure if there’s like any examples of agents like preventing like humans from seeing what’s going on like in real life or if it’s like toy experiments right now. So that’s the thing, right, is I believe at the moment, most of the demonstrations or kind of like analysis that you have are either like, you know, very simple decision problems like the off switch game where there’s like theoretical analysis of, you know, if the agent believes X or, you know, you make certain assumptions about the agent’s beliefs about the player. Or you specify a certain training setup, you know, then you can get a theoretical analysis of like, okay, under this assumption, you get this behavior under that assumption, you get that behavior. And then like, you know, the level above that is like grid worlds.
Curtis: So I’m aware of like off switch grid worlds where like a user has an off switch button that turns the agent off and prevents it from continuing to do what it was previously doing. But, you know, anything more sophisticated than that, I am not aware of anything like that really existing.
Michaël: I think just like people that are like watching this might not like not know exactly what is the off switch game. Can you just like explain what is this game?
Curtis: Yeah, so I mean, like at a high level, the off switch game is really like a simplification of kind of the more complicated example that I put where it’s like the agent is doing a thing and there’s an off switch. And when you decide that you don’t want the agent to be doing the thing, you press the off switch and the agent won’t try to interfere and prevent you from pressing the off switch. Might not be exactly that my recollection of the details of the off switch game are a little bit fuzzy. I’m more just aware of like, this is like the simplest example of corrigibility that I’m aware of. I believe it’s Dylan Hatfield Manel is probably the words you want to Google with regards to the off switch game. I think there’s a paper on it from like 2016 or 2014 years.
Curtis: And I think like they model it as like some game theory experiment and maybe there’s like some way of like knowing exactly what’s like optimal values you give to your agent so that he accepts to like be corrigible. Yeah. How excited are you about like, you know, like finding this corrigibility inside of a Minecraft environment or like Minetest? So I think we’re going to be able to fairly easily demonstrate the failure modes that we’re looking for. Fixing those failure modes is kind of another story entirely. And especially being able to do it reliably seems like it’s going to be quite a challenge. I have like some vague intuitions about ideas that articulate yet about how to kind of go about actually tackling the problem. But nothing that I would say is super concrete. And I think that it’s probably better to wait until we have like a solid demonstration before we really start thinking or like, you know, deploying things because what’s going to happen is we’re going to get an implementation and we’re going to get some data. And like whatever, you know, some of at least some of the assumptions that I’m kind of making right now are going to be proven wrong. So yeah, that’s kind of like where we’re headed right now. ⬆
People ordering H100s Are Aware Of Other People Making These Orders, Race Dynamics, Last Message
Michaël: I wonder how infohazardy it is to like post something about race.
Curtis: I think the cat’s out of the bag on that one. The people who have the 8800 GPUs order, they probably know what’s going on. Yeah, again, the people who are making these orders are aware of the other people also making these orders, right? This is nothing, the relevant actors are all aware of what is going on. In fact, that is part of like what makes it so terrible is that they are very, you know, the people with the most impact or chance to really like do these kind of things are the most acutely aware and the most like ready to race. Right? Because if they weren’t racing, they wouldn’t be trying to raise a billion dollars to build these giant supercomputers so that they can build AGI.
Michaël: It was a pleasure to have you. Do you have like any last message for the audience, for Aluthor AI, the world, the machine learning engineers, the people ordering GPUs, Align? Come hang out in our Discord. Come hang out on #off-topic. Say hi to the people there. ⬆