Ethan Caballero On Why Scale is All You Need
Ethan is known on Twitter as the edgiest person at MILA. We discuss all the gossips around scaling large language models in what will be later known as the Edward Snowden moment of Deep Learning. On his free time, Ethan is a Master’s degree student at MILA in Montreal, and has published papers on out of distribution generalization and robustness generalization, accepted both as oral presentations and spotlight presentations at ICML and NeurIPS. Ethan has recently been thinking about scaling laws, both as an organizer and speaker for the 1st Neural Scaling Laws Workshop.
r/singularity (5 comments): Exponential deep learning scaling is happening privately
r/mlscaling (5 comments): Scale is all you need
Contents
- Introduction
- Scaling Laws T-Shirts
- Scaling Laws, Upstream and Downstream tasks
- Defining Alignment and AGI
- AI Timelines
- Recent Progress: AlphaCode, Math Scaling
- The Chinchilla Scaling Law
- Limits of Scaling: Data
- Code Generation
- Youtube Scaling, Contrastive Learning
- Scaling Exponent for Different Modalities
- AGI Race: the Best Funding Model for Supercomputers
- Private Research at Google and OpenAI
- Why Ethan did not update that much from PaLM
- Thinking about the Fastest Path
- A Zillion Language Model Startups from ex-Googlers
- Ethan’s Scaling Journey
- Making progress on an Academic budget, Scaling Laws Research
- AI Alignment as an Inverse Scaling Problem
- Predicting scaling laws, Useful AI Alignment research
- Ajeya Cotra’s report, Compute Trends
- Optimism, conclusion on alignment
Introduction
Michaël: Ethan, you’re a master’s degree student at Mila in Montreal, you have published papers on out of distribution, generalization, and robustness generalization accepted as presentations and spotlight presentations at ICML and NeurIPS. You’ve recently been thinking about scaling laws, both as an organizer and speaker for the first neural scaling laws workshop in Montreal. You’re currently thinking about the monotonic scaling behaviors for downstream and upstream task, like in the GPT-3 paper, and most importantly, people often introduce you as the edgiest person at Mila on Twitter, and that’s the reason why you’re here today. So thanks, Ethan, for coming on the show and it’s a pleasure to have you.
Ethan: Likewise.
Scaling Laws T-Shirts
Michaël: You’re also well-known for publicizing some sweatshirt mentioning scale is all you need AGI is coming.
Ethan: Yeah.
Michaël: How did those sweatshirts appear?
Ethan: Yeah, there was a guy named Jordi Armengol-Estapé who interned at Mila, and he got really into scaling laws, apparently via me. And then he sent me the shirt and was like: look how cool this shirt is. Like, he’s the person wearing the shirt in the picture, and he’s like, look how cool this shirt I just made is. And so then I tweeted the shirt. And then Irina just turned it into a merchandising scheme to fund future scaling. So she just made a bunch and started selling it to people. Like apparently, like she sells like more than 10 to Anthropic already. Just scaling lot of t-shirts, that’s the ultimate funding model for supercomputers.
Scaling Laws, Upstream and Downstream tasks
Michaël: Maybe you can like explain intuitively for listeners that are not very familiar to what are scaling laws in general.
Ethan: Whatever your bottleneck compute data parameters, you can predict what the performance will be as that bottleneck is relieved. Currently, the thing most people know how to do is predict like the upstream performance. Like the thing people want though is to be able to predict the downstream performance and upstream is what you’re like… It’s like your literal loss function that you’re optimizing and then downstream is just any measure that you have of, like something you care about, so just like a downstream dataset, or like, I mean, usually, it’s just mean accuracy on a downstream dataset.
Michaël: And to take like concrete examples, like for GPT-3, the upstream task is just predict the next word. What are the downstream tasks?
Ethan: Like 190… a zillion like benchmarks that the NLP community has come up with over the years. Like they just evaluated like the accuracy and like things like F1 score on all those.
Michaël: And yeah, what should we care about upstream and downstream task?
Ethan: I mean, basically like up, well, we don’t really care about upstream that much. Upstream’s just the first thing that people knew how to predict, I guess, like predict the scaling of what we care about as downstream. I mean, basically, like downstream things that improve monotonically, they kind of can be interpreted as like capabilities or whatever, and then downstream stuff that doesn’t necessarily improve monotonically often is stuff that is advertised as alignment stuff. So like toxicity or if you like speculate in the future, stuff like interpretability or controllability would be things that might not improve monotonically.
Michaël: So you don’t get more interpretability as you scale your models?
Ethan: You do currently, but the class example is like CLIP. It gets more interpretable as it has representations that make more sense. But you can imagine at a certain point, it’s less interpretable because then at a certain point, the concepts it comes up with are beyond human comprehension. Like now it’s just how like dogs can’t comprehend calculus or whatever.
Defining Alignment and AGI
Michaël: Yeah, when you mention alignment, what’s the easiest way for you to define it?
Ethan: I mean, the Anthropic definition’s pretty practical. Like we want models that are helpful, honest, and harmless, and that seems to cover all the like weird edge cases that people can like come up with on the Alignment Forum or whatever.
Michaël: Gotcha, so it is not like a technical definition. It’s more a theoretical one.
Ethan: Yeah, yeah.
Michaël: So would you consider yourself an alignment researcher or more like a deep learning researcher?
Ethan: I’d say just a beneficial AGI researcher. That seems to cover everything.
Michaël: What’s AGI?
Ethan: The definition on NASA website’s pretty good. Highly autonomous systems that outperform humans at most economically valuable tasks.
AI Timelines
Michaël: When do you think we’ll get AGI?
Ethan: I’ll just say like, it depends mostly on just like compute stuff, but I’ll just say 2040 is my median.
Michaël: What’s your like 10% and 90% estimate?
Ethan: 10%, probably like 2035.
Recent Progress: AlphaCode, Math Scaling
Michaël: I think there’s been a week where we got DALL-E 2, Chinchilla, PaLM. Did that like update your models in any way?
Ethan: The one that I thought was the like… was the crazy day was the day that AlphaCode and the math-proving thing happened on the same day, because like, especially the math stuff, like Dan Hendricks has all those slides where he is like, oh, math has the worst scaling laws or whatever, but then like OpenAI has like the IMO stuff. So like at least according to like Dan Hendricks’ slides, whatever, that would’ve been like, something that took longer than it did.
Michaël: So when you mentioned the IMO stuff, I think it was like at problem from maybe 20 years ago, and it was something that you can like do with maybe like two lines of math.
Ethan: I agree they weren’t like super, super impressive, but it’s more just the fact that math is supposed to have like the worst scaling supposedly, but like impressive stuff’s already happened with math now.
Michaël: Why is math supposed to have the worst scaling?
Ethan: It’s just an empirical thing. Like Dan Hendricks has that like math benchmark thing and then he tried to do some extrapolations based on the scaling of performance on that. But the amount of computing data we currently have, it’s already like doing interesting stuff was kind of surprising for me.
Michaël: I think in the paper, they mentioned that the method would not really scale well because of, and some infinite actions base when trying to think of like actions.
Ethan: Yeah.
Michaël: So yeah, I didn’t update it. I was like, oh yeah, scaling will be easy for math.
Ethan: I didn’t update it as easy, but just easier than I had thought.
The Chinchilla Scaling Law
Michaël: Okay, related to scaling, the paper by DeepMind about the Chinchilla model was the most relevant, right?
Ethan: Yeah, I thought it was interesting. Like, I mean, you probably saw me tweet it, like that person on Eleuther Discord that was like, oh wait, Sam Altman already said this like six months ago, but they just didn’t put it in a paper.
Michaël: Yeah, he said that on the Q&A, right?
Ethan: Yeah, yeah.
Ethan: Yeah, he said something like we shouldn’t, our models will not be like much bigger.
Ethan: Yeah. He said they’ll use way more compute, which is analogous to saying, there you’ll train a smaller model, but on more data.
Michaël: Can you like explain the kind of insights from scaling laws between like compute model size, and then like what’s called like the Kaplan Scaling law?
Ethan: It was originally something about computing. If your compute budget increase a billionfold, your model size increases a millionfold and your dataset size increases a thousandfold. And now it’s something like, I know it’s like one to one, but I don’t remember like how big the model size to like compute ratio was. I know like the model-to-data ratio is one to one now, but I don’t remember what the compute-to-model ratio is, the new compute-to-model ratio is.
Michaël: That’s also what I remember, and I think like the main insight from the first thing you said from the Kaplan law is that like model size is all those matters compared to dataset and for a fixed compute budget.
Ethan: Yeah, the narrative with the Kaplan one was model size, like compute is the bottleneck for now until you get to the intersection point of the compute scaling and the data scaling, and at that point, data’s gonna become more of a bottleneck.
Michaël: So compute is the bottleneck now. What about like having huge model?
Ethan: But yeah, yeah. That’s like, because like they were saying that because model size grows so fast. So like to get the bigger models, you need more compute rather than like, you don’t need more data ‘cause like you don’t even have enough compute to like train a large model on that data yet, with the current compute regime… was the narrative of the first of the original Kaplan paper. But it’s different now because like the rate at which you should be getting data given, like the rate at which your data charge should be increasing given your compute budget is increasing is a lot faster now, like using the Chinchilla scaling law. For some increasing compute size, you’re gonna increase your model by a certain amount, and the amount that you’re dataset size increases is like a one-to-one relation to the amount that your model size increases. I don’t remember what the relation between model and compute was, but I know that now the relation between model and dataset size is one to one, between model size and dataset size is one to one.
Ethan: And the main size is that now we can just have more data and more compute, but not like a lot of more compute. We just need the same amount as more compute. So we can just like have to scrap the internet and get more data.
Ethan: It just means like to use your compute budget optimally, the rate at which your dataset size grows is a lot faster.
Michaël: Does that make you more confident that we’ll get like better performance for models quicker?
Ethan: Maybe for like YouTube stuff, because YouTube, we’re not bottlenecked by data. We’re bottlenecked by compute, whatever. But that implies the model sizes might not grow as fast for YouTube or whatever. But for text, we’re probably gonna be bottlenecked by… It means we’re probably gonna be bottlenecked like text and code by the dataset size earlier than we thought. But for YouTube, that might like speed up the unsupervised video on all of YouTube, like timeline stuff.
Limits of Scaling: Data
Michaël: Yeah, so I’m curious when do you think about like how much are we bottlenecked by data for text?
Ethan: Yeah, I asked Jared Kaplan about this, and he said like, “Wait, okay. “It’s 300 billion tokens for GP3.” And then he said like, library of Congress, whatever, could be 10 trillion tokens or something like that. And so like the most pessimistic estimate of how much like the most capable organization could get is the 500 billion tokens. A more optimistic estimate is like 10 trillion tokens is how many tokens the most capable organization could get, like mostly English tokens.
Michaël: So how many like orders of magnitude in terms of like parameters does this give us?
Ethan: I don’t remember what the… Like I haven’t calculated it. Like I remember I kind of did it with the old one, but I haven’t done it with the new Chinchilla one. But I mean, you said this in your thing today or whatever, like we probably are gonna be bottleneck by the amount of code.
Michaël: I was essentially quoting Jared Kaplan’s video.
Code Generation
Ethan: Yeah, yeah, but he, I mean, he’s right. I’m kind of wondering what’s philanthropic thinking of Adept, because Adept’s like doing the training all the code thing, and Adept was gonna do all the train on all the code thing, and they’re like, oh crap, we got another startup doing the train on all the code stuff.
Michaël: Yeah, so I think you said that if you remove the duplicates on GitHub, you get some amount of tokens, maybe like 50 billion tokens, 500, I’m not sure. Maybe 50 billion. Don’t put me on that.
Ethan: Yeah.
Michaël: And yeah, so the tricks will be data augmentation… you’re like applying the real things to make your model better, but it’s not clear how do you improve performance? So my guess would be you do transfer learning, like you train on like all the different languages.
Ethan: That’s definitely what they plan on doing, like you see the scaling lots for transfer paper is literally pre-train on English and then fine-tune on code.
Michaël: My guess is also that like, if you get a bunch of like the best programmers in the world to use co-pilot and then you get like feedback from what they accept, you get higher quality data. You get just like, oh yeah, this work just doesn’t work. And so you have like 1 million people using your thing 100 times a day, 1,000 times a day, then that’s data for free.
Ethan: I mean, I view that part kind of as like the human feedback stuff is kind like the alignment part is the way I view it. I mean, then there’s some people who like say, oh, there might be ways to get like better pre-training scaling if you have like humans in the loop during the pre-training, but like, no one’s really figured that out yet.
Michaël: Well, don’t you think like having all this telemetric data from GitHub cooperatives is you can use it, right?
Ethan: Yeah, yeah, but I almost view it as like that it’s like used for alignment, like for RL from human preferences.
Michaël: Okay. Gotcha. Yeah, I think the other thing they did for improving GPT-3 was just having a bunch of humans rate the answers from GPT-3 and then like that’s the paper of instructivity. I think like they had a bit of humans and it kind of improved the robustness or not for business, but alignment of the answer somehow. Like it said less like non-ethical things.
Ethan: Yeah. I mean it’s like people downvoted the non-ethical stuff, I think.
Youtube Scaling, Contrastive Learning
Michaël: Exactly, yeah. And to go back to YouTube, why is scaling on YouTube interesting? Because there’s unlimited data?
Ethan: Yeah, one, you’re not banned, but I mean, the gist is YouTube’s the most diverse, like simultaneously diverse and large source of like video data basically.
Michaël: And yeah. So for people who were not used to or thinking, what’s the task in YouTube?
Ethan: Yeah, it could be various things. Like it might be like a contrastive thing or it might be a predict all the pixels thing. Like, I mean, so like at least places like Facebook seem to think like contrastive has better downstreams scaling laws, so it’s gonna be a contrastive type thing.
Michaël: What’s contrastive type thing?
Ethan: Like you want representations that have similar like semantic meaning to be close together, like have low cosign similarity, like in latent space. So basically, like maximize the mutual information between views. Like it’s kind of hard to explain without pictures.
Michaël: So you’d say that your model takes a video, like all of the videos and views as input?
Ethan: Frames that were close together like in time, it tries to maximize the mutual information between them via maximizing cosign similarity between the latents of like a resonant encoder or whatever that encodes the images for both of those frames that were next to each other, like in time.
Michaël: So he tries to kind of predict correlations between frames in some kind of latent space from a resonance?
Ethan: Yeah, yeah. In the latent space, you want frames that were close to each other in time to have similar, like maximize the cosign similar between the latent space between the latent between the hidden layer output by the like resonance that took each of those in each of those frames in.
Michaël: And at the end of the day, you want something that is capable of predicting how many frames in lens.
Ethan: Kind of for, well, the like philosophy with like the contrastive stuff is we just want a good representation that’s useful for downstream tasks or whatever. So like you don’t actually like, there’s no like output really. It’s just you’re training a latent space or whatever that can be fine-tuned to downstream tasks very quickly.
Michaël: What are the useful downstream tests, like robotics?
Ethan: Yeah, yeah. Like there’s a zillion papers on like people pre-train on do some pre-train contrastive thing in like an Atari environment, and then they show like, oh, now we barely need any RL steps to like fine-tune it or whatever and it can like learn RL really quickly after we just did all this unsupervised contrastive, like pre-training or whatever.
Michaël: And yeah, wouldn’t your model be kind of shocked by the real world when you just like show him like YouTube videos all the time and then you trust the robot with like a camera?
Ethan: Kind of not. I mean, ‘cause there there’s like everything on YouTube. They got like first person egocentric stuff, they got third person stuff. Like it’ll just like realize which, like whether it’s in first or third person pretty quickly. I feel like it just infers the context. Like now I saw GPT-3 just for the context, it’s in, ‘cause it seemed like every context ever.
Michaël: Gotcha. So I was mostly thinking about like entropy of language.
Ethan: If it’s literally like a video generative model, then you can do like just the perfect analogies, GPT-3 or whatever. It gets a little trickier with like contrastive stuff, but yeah, I mean either one. I mean the analogies are pretty similar for either one.
Michaël: So one of the things about the scaling laws papers and the role of scaling laws, there was some different exponents for text.
Ethan: Yeah.
Scaling Exponent for Different Modalities
Michaël: What do you think is the exponent for video? Would it be like much worse?
Ethan: I know the model size. The model size relation was the big point of the scaling laws. For autoregressive generative models, the paper says that the rate at which the model size grows, given your compute budget grows, is the same for every modality. So that was kind of like, that’s like a big unexplained thing. Like that was the biggest part just of that paper and no one’s been able to explain why that is yet.
Michaël: So there might be some universal law where scaling goes for all modality and nobody knows why.
Ethan: Just stuff. The rate at which your model size grows given your compute budget is increasing is the same for every modality, which is kind of weird and no one, like I haven’t really heard a good explanation why.
Michaël: Who do you think will win the video prediction race?
AGI Race: the Best Funding Model for Supercomputers
Ethan: The person who wins AGI is whoever has the best funding model for supercomputers. Whoever has the best funding model for supercomputers wins. Like, I mean yet to assume all entities are like, they have like the nerve, like we’re gonna do the biggest training run ever, but then given that’s your pre-filter, then it’s just whoever has the best funding models for supercomputers.
Michaël: So who is able to spend the most money? So would it be USA, China, Russia?
Ethan: Yeah, yeah, it might be something. I mean, my guess is like China’s already, like they already have this joint fusion of industry government and academia via the Beijing Academy of AI in China. So my guess is like at some point, like Beijing Academy of AI and be like, look, we just trained like a 10 to the 15 parameter model on all of YouTube and spent like $40 billion doing it. And then at that point, Jared Kaplan’s gonna be in the White House press conference room, be like, look, see these straight lines on log log pots, we gotta do this in the USA now.
Michaël: Right, right. But how do you even spend that much money?
Ethan: By making people think if they don’t, they’ll no longer be the superpower of the world or whatever. Like China will take over the world or whatever. Like it’s only like a fear. It’s only a fear thing.
Michaël: From looking at the PaLM paper from Google, they seem pretty clever on how they use their compute.
Ethan: You mean the thing where they have like the two supercomputers that they split it across or whatever?
Michaël: Right. I think TPU pods or something, they call it.
Ethan: Yeah, yeah.
Michaël: So it didn’t seem like they spent more money than OpenAI. So they tried to be more careful somehow. So my model of like people spending a lot of money is.
Ethan: Like most entities won’t be willing to like do the largest training when they can, given their funding.
Michaël: So maybe China, but I see Google as being more helpful because of they do it on paper, but maybe I’m wrong.
Ethan: Jared Kaplan says like most like Anthropic and OpenAI are kind of unique in that they’re like, okay. We’re gonna like throw all our funding into this one big training run. But like Google and like ‘cause Google and Amazon, they have like he said like at least, 10X or like 100X times the compute that OpenAI and Anthropic have, but they never like use all the compute for single training runs. They just have all these different teams that use to compute for these different things.
Michaël: Yeah, so they have like a different hypothesis. OpenAI is like scale is all that matters, somehow that they’re secrets itself and-
Ethan: Yeah, it’s something like that.
Michaël: You just let scale things and we are going to get better results, and Google is maybe there’s more bureaucracy and it’s maybe harder to get a massive budget.
Private Research at Google and OpenAI
Ethan: Yeah, it’s weird though, ‘cause Jeff Dean’s latest blog post, it summarizes all the Google’s research progress mentions like scaling and scaling while it’s a zillion times. So that almost implies that like they’re on the scales. All you need bandwagon too. So I don’t know.
Michaël: They probably know, but then the question is how like private things are and maybe there’s stuff we don’t really know.
Ethan: I know a bunch of Google said like, yeah, we have language models that are way bigger than GPT-3, but we just don’t put ‘em in papers.
Michaël: So you’ve talked to them like privately or is it just, they said online?
Ethan: I just I’ve heard things from people and that’s feasible. I’m not just disclosing what I got that information from, but that’s just what I’ve heard from people.
Michaël: So as we’re on like gossip, I think like something that was around on the internet, like right when GPT-3 was launched was that Google was like reproduced it in a few months afterwards, but they didn’t really talk about it publicly. I’m not sure about what to do with this information.
Ethan: I know like the DeepMind language models papers that they were a year old when they finally put ‘em out on archive or whatever, like Gopher and Chinchilla. They had the language model finished training a year before the paper came out.
Michaël: So we should just like assume all those big companies are just like throwing papers when they’re like not relevant anymore when they have like the other paper already?
Ethan: Maybe, but yeah. I don’t know why it was delayed that much. Yeah, I don’t know what the story is. Why it was delayed that long.
Michaël: People want to like keep their advantage, right?
Ethan: I guess, but I mean like I feel like GPT-3, they threw the paper on arXiv pretty soon after they finished training GPT-3.
Michaël: How do you know?
Ethan: Yeah, I don’t, but I mean, yeah, I don’t. But ice, it didn’t. Yeah, maybe there was a big delay. I don’t know.
Michaël: So I think you could just like retrace all Sam Altman tweet and then like you read the next paper like six months after and you’re like, oh yeah, he tweeted about that. Like sometimes the tweets like, oh, AI is going to be wild, or oh, neural networks are really capable of understanding. I think you tweeted that like six months ago, like when they discovered GPT-4.
Ethan: OpenAI is like when Ilya tweeted the consciousness tweet, they’re like, goddamn, GPT-4 must be crazy.
Michaël: Yeah, neural networks are in some ways slightly conscious.
Ethan: Yeah, yeah, that was the funniest quote.
Michaël: Yeah, I think people at OpenAI know things we don’t know yet. They’re all like super hyped. And I think you mentioned as well that at least privately that Microsoft has some deal with OpenAI and so they need to some amount of money before 2024, like.
Ethan: Oh yeah, yeah, yeah, yeah. I mean, right, right. When the Microsoft deal happened, like Greg Brockman said, “Our plan is to train “like a 100 trillion parameter model by 2024.”
Michaël: Okay, so that’s in two years?
Ethan: I mean, that was in 2019, but maybe they’ve changed their mind after like the Chinchilla scaling lot stuff, I don’t know.
Why Ethan did not update that much from PaLM
Michaël: Right. And so you were not like impressed by PaLM being able to predict to like do logic on airplane things and explain jokes?
Ethan: In my mind, like the video scaling was like a lot worse than text basically. That’s the main reason why I like AGI will probably take longer in the five years or whatever in my mind.
Michaël: Okay, so we need, so if we just have text, it’s not enough to have AGI. So if we’re a like a perfect Oracle that can like talk like us, but it’s not able to do robotic things, then we don’t have AGI.
Ethan: Yeah.
Michaël: Well, I guess my main like is mostly like coding. So if we get like coding, like Codex or comparative, that gets really good, then everything accelerates and engineers become very productive, and then like.
Ethan: I guess if you could say like, engineers get really productive at making improvements in hardware, then like, maybe that would, like, I get how that would be like, okay. Then it’s really fast. Like in my mind, at least at the current, I don’t see the hardware getting fast enough to be far enough on the YouTube scaling lot in less than five years from now.
Michaël: Thinking about hardware, we’re just like humans, Googling things and using.
Ethan: Yeah, yeah. I get what you’re saying. Like you get the Codex thing and then we use Codex or whatever to design hardware faster.
Michaël: You mentioned you have like DALL-E, but like for designing chips.
Ethan: I mean, Nvidia already uses AI for designing their chips.
Michaël: That doesn’t make you think of timelines of 10 years or closer.
Ethan: It may be 10 years, but not five years. The thing I’m trying to figure out is like, try to get like a student researcher gig at like someplace so that I can just get access to the big compute during the PhD.
Michaël: Oh, so that’s your plan. Just get out of compute.
Ethan: Yeah, I mean, as long as I have big compute, it doesn’t matter where I’m a PhD. I mean, it kind of matters if you’re like trying to start an AGI startup or whatever, but safe, safe, safe AGI startup.
Michaël: We’re kind of on record, but I’m not sure if I’m going to cut this part. So you can say unsafe, it’s fine.
Ethan: Yeah, no, no, no. I mean, I don’t even phrase. I just phrase it as beneficial AGI.
Michaël: You were spotted saying you wanted unsafe AGI the fastest possible.
Thinking about the Fastest Path
Ethan: No, no, no. The way I phrase it is I think I explained this last time, you have to be thinking in terms of the fastest path, because there is like extremely huge economic and military incentives that are selecting for the fastest path, whether you want it to be that way or not. So like, you gotta be thinking in terms of, what is the fastest path and then how do you like minimize the alignment tax on that fastest path? ‘Cause like the fastest path is the way it’s probably gonna happen no matter what, like, so it’s about minimizing the alignment techs on that fastest path.
Michaël: Or you can just throw nukes everywhere and try to make things slower?
Ethan: Yeah, I guess, but I mean the people who are on the fastest path will be like more powerful, such that like, I don’t know, such that they’ll deter all the nukes.
Michaël: So you want to be, okay, so you want to just like join the winners. Like if you join the skiing team at Google.
Ethan: Thing I’ve been trying to brainstorm about is who’s gonna have the fastest, who’s gonna have the best funding models for supercomputers, ‘cause that’s the place to go and you gotta try to minimize the alignment tax at that place.
Michaël: Makes sense. So everyone should infiltrate Google.
Ethan: Yeah, so whatever place ends up with the best funding model of supercomputers try to get as many weird alignment people to like infiltrate that place as possible.
Michaël: So I’m kind of happy having a bunch of EA people at OpenAI now, because they’re kind of minimizing the text there, but…
Ethan: Yeah, I kind of viewed it as all the EA people left, like ‘cause Anthropic was like the most extremist EA people at OpenAI. So I almost viewed when Anthropic happened a bunch of EA people. I view as that like EA almost leaving OpenAI when Anthropic happened.
Michaël: Some other people came, right?
Ethan: Like who?
Michaël: I don’t know. Richard Ngo.
Ethan: Oh, okay, okay. Yeah, yeah.
Michaël: It’s like a team on like predicting the future.
Ethan: Yeah, yeah. I wanna know what the Futures Team does ‘cause that’s like the most out there team. I’m really curious to what they actually do.
Michaël: Maybe they use their GPT-5 model and predict things.
Ethan: Right, ‘cause I mean like DALL-E, like you know about the Foresight Team at OpenAI, right?
Michaël: They were trying to predict things as well, like forecasting.
Ethan: Yeah, that’s where all this scaling lot stuff came from was on the Foresight Team at OpenAI. They’re gone now because they became philanthropic. But like a team called like the Futures Team that almost has a similar vibe to like a team called the Foresight Team. So I’m kind of curious.
Michaël: But then there’s just like doing more governance things and optimal governance and maybe economics.
Ethan: That’s what it’s about, governance and economics.
Michaël: The guy like Richard Ngo is doing governance there.
Ethan: Okay.
Michaël: Predicting how the future works, I think is in his Twitter bio.
Ethan: Yeah, yeah, but I mean, that’s somewhat tangential to governance, like that almost sounds like something Mike Rick Kurtz, I would say, I’m predicting how the future.
Michaël: My model is like Sam Altman, as like they have GPT-4. Like they published GPT-3 in 2020. So it’s been like two years.
Ethan: Yeah.
Michaël: And they’ve been talking about like in their Q & A about like treacherous results or something like one year ago. So now they must have access to something very crazy and they’re just like trying to think like how do we operate with like DALL-E 2 and their GPT-4 they have in private and how they do something without like for him in the world? I don’t know. Maybe they’re just like trying to predict like how to make the most money with their API or.
Ethan: You’re saying like if they release it, it’s like an infohazard? ‘Cause in my mind, GPT-4 still isn’t like capable enough to F up the world, but you could argue, it’s like capable enough to like be an infohazard or something.
Michaël: Imagine you have access to something that has the same kind of gap between GPT-2 and GPT-3, but like for GPT-4 on like understanding and being general. And you don’t want everyone else to copy your work. So you’re just going to keep it for yourself for sometime.
A Zillion Language Model Startups from ex-Googlers
Ethan: Yeah, but I feel like that strategy is already kind of screwed. Like you know about how like a zillion large language model, like a zillion Googlers have left Google to start large language model startups. Like there’s literally three large language model startups by ex-Googlers now. OpenAI is like a small actor in this now because there’s like multiple large language model startups founded by ex-Googlers that all like that all were founded in the last like six months. Like there’s a zillion VCs throwing money at large language model startups right now. The funniest thing, like Leo Gao, he’s like, we need more large language model startups because the more startups we have, then it splits up all the funding so no organization can have all the funding to get the really big supercomputer. So we just need thousands of AI during its final startups. So no one can hoard all the funding to get the really big language model.
Ethan: That’s the, yeah, with the AI model, you just like do open source. So like there’s like more startups. And so all the funding gets splitted, I guess.
Ethan: Yeah, you could view OpenAI was like extra big brain. We need to do. We need to like release the idea of our joiners models onto the world such that no organization could have enough compute to be such that all the compute gets more split up, ‘cause a zillion, our joiners model startups will show up all at once.
Michaël: That’s yeah, that’s the best idea ever. So do you have like other gossips besides like Google’s? Did you post something on Twitter about people leaving Google?
Ethan: Yeah, I posted a bunch of stuff. Well, I mean, and also like you saw the… I mean it’s three startups, adept.ai, character.ai, and inflection.ai. They’re all large language model startups founded by ex-Googlers that got a zillion dollars in VC funding to scale large language models.
Michaël: What’s a zillion dollars like?
Ethan: Like greater than 60 million. Each of them got greater than 60 million.
Michaël: So did they know about something we don’t know? And they’re just like get money to replicate what Google does?
Ethan: Well, I mean, most of ‘em, they were famous people like founder of DeepMind scaling team. Another one is the inventor of The Transformer. Another one was founded by a different person on The Transformer paper. Like, so I mean, in some ways, they have more clout than like OpenAI had or whatever.
Michaël: But they don’t have like the engineering and old infrastructure.
Ethan: No, they kind of do. Like, a lot of ‘em, they were like the head of engineering for scaling teams at like DeepMind or Google.
Michaël: So there’s like another game that is in private at Google and they’ve been scaling huge models for two years. and they’re just like,
Ethan: Yeah, something like that.
Michaël: Starting startups with their knowledge and they’re just scaling and we;re just like, like peasants like us talk about papers that are released one year after and then when you turn them out.
Ethan: Yeah, yeah. I guess that’s, I mean, I don’t know how long these delays are. I mean, in my mind, like, yeah. I guess you could view it as a delay thing, ‘cause like in my mind it’s just like, yeah, you’re right, you’re right. It’s probably delayed by a year, yeah.
Michaël: So yeah, that makes me less confident about-
Ethan: Oh shit. You look like a clone of Lex Fridman from the side.
Michaël: What?
Ethan: When your face is like sideways, you look like a clone of Lex Fridman.
Michaël: Yeah.
Ethan: Like, ‘cause your haircut’s like identical to his when
Michaël: I’ll take that as a compliment… I started working out. So yeah, Ethan Caballero, what’s the meaning of life?
Ethan: Probably just maximize the flourishing of all sentient beings, like a very generic answer.
Michaël: Right. So I’ve done my Lex Fridman question. Now I’m just basically him.
Ethan: Yeah.
Ethan’s Scaling Journey
Michaël: Maybe we can just go back to like stuff we know more about like your work and because you’ve been doing some work on scaling.
Ethan: Yeah.
Michaël: So like more general, like why are you kind of interested in scaling and like how did you started on doing research on that?
Ethan: I mean, I knew about the body paper when it came out. Like I remember I was at this like Ian Goodfellow talking in 2017 and he was hyped about the body paper when it came out.
Michaël: Which paper?
Ethan: The deep burning scales, predictably, empirically, yeah, it came out 2017 and then I kind, I just, that was just on the back burner and I kind of just stopped paying attention to it after a while. And then like Aran Komatsuzaki was like, no, dude, this is the thing. Like this is gonna take over everything, and this was like in 2019 when he was saying that. And then, yeah. So then when the scaling laws papers got like re-popularized through like the OpenAI stuff, then I kind of like caught onto it a little bit early via like talking with Aran.
Michaël: I think in 2019 was also when GPT-2 was introduced.
Ethan: But that was kind of before, like that was before like the scaling law stuff kind of got popularized.
Michaël: Right, scaling laws paper is 2020.
Ethan: Yeah, the very end of 2020. All right. No, no, no. Oh no, no. The scaling law paper was the very end of 2000. It was the very beginning of 2020.
Michaël: And you were already on this killing train since 2017.
Ethan: I was aware of it, but I didn’t, like, I was kind of just neutral about it until like 2000, like probably the middle of 2019.
Making progress on an Academic budget, Scaling Laws Research
Michaël: And yeah, now you are kind of interested in scaling because it’s useful to predict kind of what the whole field of AI is going.
Ethan: And also it just, it’s I think people underestimate how easy it is to be contrived if you’re not paying attention to scaling trends and trying to like extrapolate the compute budgets and data budgets that are like, well, the compute data and data budgets like five years from now.
Michaël: Yeah, if you’re a huge company that does a lot of budget, but maybe if you’re just a random company, you don’t really care about scaling law that much.
Ethan: Yeah, yeah. Or if you’re like in academia currently or whatever, like a zillion papers that like fancy conferences are like, here’s our inducted bias that helps on like our punny academic budget. And we didn’t test any of the scaling asso tos to see if it’s like useful when you’re training a trillion parameter model on all of YouTube or whatever.
Michaël: You’re on an academic budget as far as I know. So how do you manage to do experiments in scaling?
Ethan: There’s like the scaling on narrative. That’s like, oh, you don’t need the big budget to do because you can just predict what the outcomes will be for the large scale experiments, but that’s at least current. At least when that narrative got popularized, it was mostly for upstream like scaling. But the thing everyone cares about is downstream scaling.
AI Alignment as an Inverse Scaling Problem
Michaël: Yeah, so if we go back for a minute on like your work in alignment, how do you think your work on scaling or generalization like kind of fits with the alignment problem?
Ethan: Basically, all alignment, I guess this triggers the hell outta some people. But all alignment is inverse scaling problems. It’s all downstream inverse scaling problems. So it’s just in my mind, all of alignment is stuff that doesn’t improve monotonically as compute data and parameters increase.
Michaël: There’s a difference between not improving and inverse scaling. Inverse scaling goes badly, right?
Ethan: Yeah, yeah, yeah. But I just said not improved monotonically, ‘cause like sometimes there’s certain things where like it improves for a while, but then at a certain point, it gets worse. So like interpretability and controllability are the two like kind of thought experiment things where you could imagine they get more interpretable and more controllable for a long time until they get super intelligent. At that point. they’re like less interpretable and less controllable.
Michaël: Do we have benchmarks for controllability or?
Ethan: Like just like just benchmarks that rely on prompting is a form of like a benchmark of controllability.
Michaël: And kinda to summarize your take, if we were able to just scale everything well and not have this inverse scaling problem, we would get like interpretability and controllability and everything else by just like good scaling of our models. And so we’d get like alignment kind of by defaults for free?
Ethan: Yeah. I mean, I guess, I mean like there’s stuff besides interpretability, controllability, like those are just the examples. Like what you said, you asked like what’s an example where like the reason I said, I phrased it as alignment is when I said inverse scaling, I said things that don’t improve monotonically, ‘cause I just wanted to say like, yes, there’s obvious examples where it gets worse the entire time, but there’s some you could imagine where it gets good for a long time, and then a certain point, then it starts getting drastically worse. I said, all of alignment can be viewed as a downstream scaling problem. The hard part is like Dan Hendricks and like Jacob Steinhardt say like, then the hard problem though is like measurement and like finding out what are the downstream evaluations ‘cause say you got like some like fancy like deceptive AI that wants to like a treacherous turn or whatever. Like how do you even find the downstream evaluations to know whether it’s gonna like try to deceive you or whatever? ‘Cause like when I say, it’s all a downstream scaling problem, that assumes like you have the downstream test, the downstream like thing that you’re evaluating it on. But like if it’s like some weird deceptive thing, that’s like, it’s hard to even find what’s the downstream thing to evaluate it on to like know whether it’s trying deceive or whatever.
Michaël: So there’s no like test lost on this deception. We don’t know for sure how to measure and have a clear benchmark from this.
Ethan: Yeah, it’s tricky. I mean, and some people say like, well, that’s why you need better interpretability. You need to like find the deception circuits or whatever.
Michaël: Knowing that we don’t know yet, like all the different benchmarks and metrics for misalignment, don’t you think that your work on scaling can be bad because you’re actually like speeding up timelines?
Predicting scaling laws, Useful AI Alignment research
Ethan: Yeah, I get the like infohazard point of view, but like in my mind, like whether you wanna do all capabilities or alignment stuff that stands the test of time, you need really good downstream scaling prediction. Like, say you came up with some like alignment method or whatever that mitigates inverse scaling, like you need the actual functional form to know whether that thing will like keep mitigating inverse scaling when you get to like a trillion parameters or whatever. You get what I mean?
Michaël: I get you but like on a differential progress mindset, like Jared Kaplan or someone else will come up with those functional forms without your work.
Ethan: I don’t know, I don’t know. I mean, that’s the thing though, like Anthropics (ERRATUM: it’s actually a gift, and the the merch was not sent at the time of the podcast) got that paper like predictability and surprise and generative models and they’re just like, it’s unpredictable. We can’t predict it. And I’m like, ah, you guys, nah, I don’t believe.
Michaël: Right, so you’re kind of publishing papers when you’re in advance because those companies are not publishing their results?
Ethan: I don’t know. I don’t. Yeah, I don’t even, I don’t know if Anthropic does the delay type stuff that OpenAI supposedly does, but maybe they do, I don’t know.
Michaël: And you were just like drawing infohazard by publishing those laws?
Ethan: I mean, in my mind, whether or not, I get the argument, oh it, if you wanna do capabilities work that stands a test of time or alignment work that stands a test of time, in my mind, everything that people are doing in alignment will be very contrived without the functional form too though. So it’s like alignment can’t make progress without it either. So it’s like, you get what I mean?
Michaël: Another kind of view on that is that if people do impressive deploying or ML board and they’re also interested in alignment, it’s still a good thing. Like let’s take even through AI. Even if they open source their model because they did something impressive and they talk openly about alignment under Discord and gets like a lot of people that are very smart, interested in alignment. So if you publish something and you become like a famous researcher, something in two years and you talk about alignment in two years, then it’s fine.
Ethan: I sort of tweet stuff about alignment, I think. Yeah, I mean, I retweet stuff about alignment at least.
Ajeya Cotra’s report, Compute Trends
Michaël: So if we go back to thinking about predicting future timelines and kind of scaling, I’ve read somewhere that you think that in the next few years, we might get billion or trillion times of more compute, like 12 orders of magnitude more.
Ethan: Yeah, I mean, so the Ajeya Cotra report said like, it’s gonna max out probably at 10 to the 12 times as much compute as like the amount of compute in 2020, probably like 2070 or something like that. The one issue I have with the JS model is that like, she does, what does she do? It’s like it’s flops per dollar times willingness to spend its total flops that are allocated to pre-training runs. Problem is like, for the big like foundation models, like 10 of the 15 perimeter miles of the future or whatever, you’re probably gonna need high pie like memory bandwidth between all like memory bandwidth and compute bandwidth between all the compute, which means it has to be on a supercomputer. So it’s not just the flaps. It basically what really matters, at least if you’re assuming it’s like big, like 10 of the 15 parameter foundation models or whatever, like the speed of the fastest supercomputer is what matters, not just the total flaps that you can allocate, because if like all the flaps don’t have good communication between them, then they aren’t really useful for training like 10 of the 15 parameter model or whatever. Once you get to 10 of the 15 parameters, like there isn’t much reason to go beyond that. And at that point, then you just have multiple models with 10 of the 15 parameters and they’re like doing some crazy open ended, like Ken Stanley stuff and a multi-agent simulator after you do that. Like if they mentioned became like you do the 10 of the 15 parameter model feature and all YouTube, and then after that, you’ll have like hundreds of 10 of the 15 parameter models that all just duke it out in like a Ken Stanley open-ended simulator to like, get the rest of the capabilities or whatever, like once they’re in the Ken Stanley open-ended stimulator, then you don’t need high compute bandwidth between all those individual, like 10 of the 15 parameter models, like duking it out in the simulator. They can just, each one, they only needs like 10. It only needs high compute bandwidth between like its own parameters. Like, it doesn’t need high compute bandwidth between itself and the other like agents or whatever. And so in there, the flops where you could use all the flops for like the multi-agent simulation, but you only need high compute bandwidth within each agent.
Michaël: So you need a lot of bandwidths to train models because of the prioritization thing, but you only need flops to simulate on different things at the same time?
Ethan: Yeah, you only need high compute bandwidth within an individual brain, but like if you have multiple brains, then you don’t need high compute bandwidth between the brains.
Michaël: And what was that kind of simulator you were talking about, the Kenley?
Ethan: Like Ken Stanley, the open-ended guy.
Michaël: I haven’t seen that.
Ethan: Ken is like the myth day objective open-endedness, like Can Stanley’s, Jeff Cones, like all that stuff. It’s like, I don’t know. Just Google, like Can Stanley open ended at some point. You’ve probably heard of it, but it’s not like registering what I’m referencing.
Optimism, conclusion on alignment
Michaël: Okay, so maybe one kind of last open-ended question. On a scale from Paul Christiano, Eliezer Yudkowsky, Sam Altman, how optimistic are you?
Ethan: Definitely not like Eliezer, or a doomer type person. I guess probably Paul Christiano is most similar. I mean, I feel like Paul Christiano is in the middle of the people you just said.
Michaël: Right. Yeah. So you are less optimistic than Sam Altman?
Ethan: Well, yeah, I mean, basically, I think deceptive AI is probably gonna be really hard.
Michaël: So do you have like one less monologue or sentence to say about why scaling is a solution for all alignment problems?
Ethan: Like just all alignment can be viewed as an inverse scaling problem. Like it all revolves on just mitigating inverse scaling, but also you have to make sure you have like the right downstream things that you’re evaluating, like the inverse scaling and like part of what makes it hard is like you might need to do like fancy thought experiments on alignment, like counterintuitive thought experiments on alignment forum to find what are the downstream… to find what are the the downstream tests that you should be evaluating. Like whether or not there’s inverse scaling behavior on those.
Michaël: Awesome, so we get the good version, as last sentence, and that’s our conclusion. Thanks Ethan for being on the show.