Irina Rish on Scaling and Alignment

Irina Rish a professor at the Université de Montréal, a core member of Mila (Quebec AI Institute) and the organizer of the neural scaling laws workshop (towards maximally benificial AGI).

In this episode we discuss Irina’s definition of Artificial General Intelligence, her takes on AI Alignment, AI Progress, current research in scaling laws, the neural scaling laws workshop she has been organizing, phase transitions, continual learning, existential risk from AI and what is currently happening in AI Alignment at Mila.

(You can click on any sub-topic of your liking in the outline below and then come back to the outline by clicking on the green arrow)


Artificial General Intelligence

How do you define AGI?

Michaël: How do you define AGI?

Irina: You can define it in different ways. My favorite picture in all my presentations is the proverbial elephant and seven blind men, who all argue about what the elephant is. Because they are touching elephant in different places and they’re all right. It seems to be like AI researchers trying to define the AGI and arguing and disagreeing might be in similar situation. To me, I’m trying to be very precise, literal, and boring maybe, and say, “AGI means artificial general intelligence.” General means capable of solving multiple tasks. Or maybe capable of doing continual learning and learning how to do those tasks one after another. And potentially learning infinite number of them and accumulating all this knowledge. General means, well, general, multitask. Or you can possibly use something like the definition from OpenAI. The classical thing they have on their website, that it’s a autonomous system that’s capable of performing, well, I would say at human or superhuman level, most of economically available tasks. Which reduces the problem of defining of AGI to defining what all economically available tasks are. I know it’s not that precise, but it gives you an idea. It’s a system capable of considerably broad multitasking. And the quality of performance on each task is, well, sufficiently high, whatever sufficiently is. It’s not very productive to argue about very precise definitions. And it’s more productive to try figure out how to develop systems that can be such generalists. And that’s why this goes into various areas that I am actually interested in, which are more specific. Such as out-of-distribution generalization, because you want to be able to train the system on particular datasets and tasks and still be able to perform in new environments. You also would like systems to be robust to changes, whether distributional or adversarial. Again, for the purpose of that multitask-generalization. You also may want to focus on continual lifelong learning because you don’t always going to have all the data available at the same time, although most of large scale systems right now do. But you want to maintain those systems in the future and want to make sure they keep learning continually. All these areas of interest that I mention, they all are motivated by that goal that you want a generalist AI.

AGI Means Augmented Human Intelligence

Michaël: Why do you want a generalist AI?

Irina: It would be quite a useful technology to have. You can think about various applications of that, from medicine to automating all kind of boring things. And creating tools for even not necessarily classical professions, but even for artists and musicians. Basically, not replacing them, but creating tools that help people to be more creative and more expressive, even if they’re not really trained in particular art. It’s basically enhancing human capabilities, and that’s why you want AI as instrumental goal. And I also like to say that to me, AI means not just artificial intelligence, but rather augmented human intelligence.

Michaël: Don’t you think that if we have more capable artists and everyone can just type in some prompt in stable diffusion and create some top-notch art, then the actual human artists are going to be out of job soon?

Irina: I know there is a common concern among some artists. I don’t see it necessarily this way. I think that any tool gives people more capabilities, expands their horizons. It will allow artists to get even more creative, but just at a different level. Previous inventions, maybe they eliminated certain jobs and activities and they created new ones at a different level. With new tools for art generation, for content generation. The art will also go into how to generate prompts. It’s not absolutely trivial, and that will be art on its own. Adobe Photoshop didn’t put photographers completely out of job. There is still a lot to do in terms of symbiotic relationship between people and AI. And that’s why to me, AGI is not goal on its own. It’s more of about advancing human capabilities, not replacing them.

Solving Alignment Via AI Parenting

Michaël: Do you think at some point, we’ll advance towards capabilities that are far beyond human level and then it’ll be out of control and hard to align to human values?

Irina: Another common concern. We all have heard those concerns multiple times. Maybe I’m overly optimistic but I think it’s possible to develop AI and while developing it, you can keep aligning it. The example could be essentially, when you have member of society and you’re interacting with that member, it could be either human or it could be an agent. Essentially you shape their behavior of that agent. You can even make analogy with parenting. I know it’s I’m not the only one making this analogy. I also tweeted about that recently. But you can consider growing AGI child. And teaching values, teaching preferences, teaching behaviors, while at the same time, showing the world and showing the good, the bad and the ugly. The system can learn from the whole of the internet. But at the same time, maybe you can also train the system to have a classifier of good or bad content. It’s not only learns probability distribution of how things are, but also some notion of values. I’m pretty much just thinking aloud right now. And there are many possible ways. The common way of trying to align systems right now I think is you have adult GPT-3, for example. It’s already pre-trained on all the internet, including bunch of garbage there. And now, you’re trying to do a little bit of a psychotherapy on the adult by doing some human feedback reinforcement learning. That’s one way. Or you could try to, as I said, you can grow a child or train a GPT-3 or other system with some type of curriculum learning. Or maybe as I said, in parallel, you can train some value system, not just the old regressive model predicting most likely next token. You can do it in different ways. And, I cannot say which one is better. It’s to be figured out. But sometimes, I make a joke that maybe you should put some curriculum learning or some caution in the training process of the systems. And maybe you shouldn’t leave their child AI unattended in front of all of YouTube, as some of my students are suggesting.

From The Early Days Of Deep Learning To General Agents

Michaël: One of your students that we had on the podcast before was Ethan Caballero. You’ve organized another neural scaling laws workshop. Which is something you’ve been doing for more than one or two years.

Irina: One year.

Michaël: One year. And the goal is to talk about the Neural Scaling Laws and scaling large AI models. And you’ve actually been in the field of Deep Learning for more than a decade, maybe even two. And you remember how it was back in 2012 when it was pre-paradigmatic, like AI safety is today. How was deep learning in 2012?

Irina: It was actually quite interesting. Indeed I’m old enough to remember those times when Deep Learning was far from being so widespread and popular. Actually, it was in a sense, criticized a lot. I remember it was 2006 or ‘7, when Geoff Hinton and colleagues suggested the workshop at NeurIPS and the workshop proposal was rejected. But they organized the workshop, like a symposium on the last day of the conference anyway. And it was quite an interesting, almost I would say religious meeting. I had some videos after that workshop. I kept them for a while and some of my students then asked about them. The videos are now on Twitter already. But it was quite interesting how the deep learning movement actually emerged essentially during the 2012 and the AlexNet moment when finally when people saw 10% improvement in classification accuracy on ImageNet, they started paying attention. But as I said before, deep learning was not really considered that seriously. And people who were claiming that it’s going to take over the world, were also not considered that seriously. And I know there are some analogies here. People saying that, “Well, who knows? Maybe people talking about scaling and particularly alignment, who are not being taken seriously now might be taking over the world in a few years.” Maybe.

Michaël: People started talking more about AGI. We had the neural scaling laws paper and GPT-3. We showed a clear path to AGI.

Irina: Well, there is clear transition going. People were always claiming, including myself, that we have great narrow AI. Blake claimed the same thing during his interview. And yes, it all was true until very recently. Because now, so you’re definitely getting a huge improvement in the capabilities being generalist. People were trying to do all kind of fancy algorithms for continual learning or meta learning and all that stuff to get to out-of-distribution generalization. There were whole bunch of papers following the in variant risk minimization. Everybody was trying to get out of distribution and trying to get their multitask agent. And then, by scaling, you get GPT-3, you get CLIP. CLIP is already zero-shot out-of-distribution, generalizing better than specifically trained on particular datasets, state-of-art systems. This whole spiel about AI is narrow and not broad, started breaking apart. I remember I had to change my slides. Because I was always, before 2020 had to talk about continual AI. And it was totally making sense that you give examples of, well, the system plays gold, the system plays chess, the system generates paintings sold at auctions in New York City. And there is no system that is generalist. And then I said like, “Okay, I cannot say that anymore. Now there is Gato. I really need to change my slides.”

Forecasting AGI Progress

Was Gato a Big Update?

Michaël: Was Gato a big update for you?

Irina: Gato is nice. But it’s still intermediate thing. It’s a demonstration that it’s possible to have one system with one set of weights that is capable of doing wide variety of things from text images, decision-making games and so on. But it doesn’t yet mean that this agent can keep learning and getting more general. In a sense, it’s proof of concept that it’s even possible to have generalist at the level of transfer across such different domains in a sense. And actually, there is more and more really mind-boggling results with in past year about say Transformers as Universal Compute Engines and Decision Transformers. Essentially what’s going on, it seems like that generic enough transformer architecture extracts representations or some kind of concept from say one modality text. And to large extent, they’re applicable to another modality, like images. You just need to maybe replace inputs, outputs, tokenizers and so on. They also apply to modalities like sequential decision making. The level to which transfer is happening really improved by order of magnitude. People are talking about transfer and continual learning between one imaging dataset to another imaging dataset. Now transfer and generalization is really happening at much more general level. That’s why people talk more about AGI because there is clear improvement in generalization.

Michaël: One problem with Gato that there was no transfer between task. It was not an example of something where you can get better performance because you had multiple tasks happening at the same time. You had worse performance than something narrow.

Irina: That’s another question. There is trade-off to an extent. You could possibly get better performance if the agent is more specialized or finetuned. But then, you might lose a possibility of having good performance on different type of tasks. The question is to fine-tune or not to finetune. You can have generalist agent and if you really need absolutely greatest performance on particular task, then you can specialize it and finetune it there. But at the same time, you can have, just in meta-learning, you keep learning a system that is very easy to adapt in maybe few shots to particular task. But I agree. You may not necessarily have agent that is simultaneously beating state-of-art on wide variety of tasks. It could be pretty good generalist, but if you really want the champion in each field, you might need to do some additional steps.

Michaël: There will be no economic incentive to build general agent. We will still keep building narrow AIs.

Irina: Not exactly. That’s not what I was saying. I think that the same incentive with generalist agent, like it was with meta learning. If you manage to have some process that gives you generalist agent as initialization, it’s much easier from that type of initialization to get your specialized narrow agents if you really need in particular applications. And that’s what people doing also with GPT-3 and other systems when they’re applying them to their particular business areas. They finetune and so on and so forth. And it’s fine because if you think about human as a generalist agent, capable of learning multiple tasks. But if you want someone to become champion in particular sport, they will have to be really finetuned for a while.

Building Truly General AI Within Irina’s Lifetime

Michaël: Do you think we’ll build truly general AI in your lifetime, something that is capable of doing everything humans can do and self-improve?

Irina: I definitely hope so.

Michaël: Why do you hope?

Irina: Well, why? I think it’s quite possible. Again, as they say, prediction is hard, especially of the future. And I’m not claiming to be not only superforecaster, but even forecaster. It’s other people area of expertise. But if you look at their current gradient, how things are developing, it’s quite plausible. I know that your next question could be to ask me when. Because everybody asks this question these days. And people start giving their wild guesses about AGI timelines. And whenever I see those and then I see discussions about the AGI timelines just moved few years earlier. To be honest, I don’t know what to tell you precisely because as I said, I’m not an expert in AGI forecasting. But what happened within couple of years in terms of the gradient, the derivative, in terms of improvements, makes one think that if you continue along that gradient, it indeed could be within 20 years or so.

Michaël: And if we’re in an exponential then it’s not this gradient.

Irina: And then it might be faster.

Irina: I don’t know, it’s wild guesses. I hate giving speculations because first of all, everybody does it. Sometimes it’s very annoying. And the second thing is I really do not know. I don’t think within say five years you will have truly generalist AI with any notion of agency and stuff like that.

The Least Impressive Thing That Won’t Happen In Five Years

Michaël: It’s better to make concrete predictions now where it is. So that we look at this video in five years and we know if you were right or not. What is the least impressive thing you don’t expect to see in five years?

Irina: What I don’t expect. I think unfortunately, I bet I’ll be happy to be proven wrong, but even in five years AI systems might still not be even robust enough in many practical applications. And it’s probably one of the most boring topics, as some people say, in the list of AI safety topics. Say Jacob Steinhardt’s paper on Concrete Problems in AI Safety. Robustness to different changes in the distribution and environment is first on the list. It’s not as exciting and mysterious as reward hacking or agency emerging there. But it’s very practical. It’s something that at least we have some idea how to go about and how to develop different methods. We can study robustness at scale, we can try different things to improve it. And lack of robustness is still there and it’s still quite dangerous. And all the classical examples about diagnostic systems trained on one hospital data that ledge on spurious correlations and then give you completely wrong diagnosis on other medical data. It can get quite unsafe. And examples go on. It’s not as sexy as Terminator scenario and an AI killing us all. It’s a boring robustness problem, but it can be quite dangerous if not dealt with. And even within five years from now, it’s not going to be completely fixed. There is still room for work, even in that area. That’s why whether it’s adversarial or out of distribution, robustness, it’s part of the problem. The part of the problem with modern deep networks is they are so large and complex. And they will be getting larger and more complex, that we are beyond the point where we can really understand their behavior analytically. Everybody knows about that. That’s why we started looking into scaling laws. Empirical methods of studying behavior. And this whole thing about studying behaviors besides just the performance or loss, essentially that’s what AI alignment is. You’re trying to understand how different changes in training procedure in data, in your interaction with system, what kind of behavior can it produce. And that becomes very empirical to a large extent if you’re studying biological or physical systems. The fact that you created that system doesn’t matter anymore. It’s at the level of complexity where it doesn’t matter whether it was created naturally or artificially.


Scaling Beyond Power Laws

Michaël: The difference with biological humans is that biology is very hard to understand, right? It’s made by evolution, it’s messy. There’s so many different neurons, and it’s intractable to understand everything going on. But transformer architecture is easy, so it seems doable to align it, right?

Irina: I’m not so sure actually, because even the simple architecture, first of all, it’s not even always clear. You can try to look into the interpretability. You can try to look at what changes in the network and in particular neurons are associated with some behavior. But I still think it’s beyond the point where you can have mechanistic models of network behavior. That’s why it becomes more similar to biological. One example is emergence of some properties at scale. There have been various papers, there is paper of course called Emergent Abilities of Large Language Models. Maybe it’s not necessarily right to call it emergent. It’s more like unexpected changes in, performance or other properties like truthfulness, for example. And so there are unexpected changes that may not even follow anything, like simple, predictable things like the famous power laws. There are various downstream tasks where even performance, not talking about robustness, truthfulness, and other properties, it start changing drastically and experiences some sharp transitions. So people are studying more general ways of modeling, scaling behavior beyond power laws. And there is lots of interesting work in that area, and we’re also trying to add our two cents to that. So hopefully I’ll put the arXiv paper out soon.

The Neural Scaling Laws Workshop

Michaël: And I believe the conference you’re organizing, so neural scaling laws workshops, is about all of this, right? So, why did you decide to organize this workshop? Did you just wake up one day and thought like, “Oh, I want to talk about scaling with other researchers.” Or, “I want bigger models.”

Irina: It’s actually interesting history. We started those workshops exactly year ago in October last year. And it was clear that since May 2020, when GPT-3 was released, the paper, Europe’s paper was released, Jared’s papers on scaling appeared, I think in January, 2020. It was under the radar. And actually thanks to one of my students, Ethan, he essentially started posting those papers on Mila’s Slack, and initially nobody paid attention.

Michaël: Which paper?

Irina: So Jared’s papers, Scaling Laws for Neural Language Models, and then for other aggressive models. And also the paper, Baidu paper from 2017. And it was pretty much mid-2020 than GPT made a splash, so it was becoming more and more clear that it was going to be a big deal. Although not, interestingly, not to majority of people in academia, they were still sleeping on it. So also interestingly, for quite a long time, Ethan wasn’t taken seriously even on Mila’s Slack because people were busy pursuing other research areas. And it’s all understandable, but then I also looked at that and said, I agree that this whole scaling law business, first of all, it’s beautiful because I do like when statistical physics or any kind of empirical science approaches I apply to AI as if it was natural system because indeed it doesn’t matter anymore. It’s complex. You can study what happens at scale. Second, it’s very good investment tool. Why it’s investment tool, because if I look at scaling of two competing algorithms or model architectures, like classical things like ConvNets or Vision Transformers. What people usually do, they would show your results on some fixed size benchmarks with fixed size models. And they say ConvNes are better, right? But it’s not very conclusive. If you look at the whole scaling curve and you see that those scaling curves actually cross, visual transformers becomes better when you have more data available. It gives you much better idea where to invest your research. Because when more data and more compute will be available, maybe you better focus on this type of architecture. And then you start thinking about the Bitter Lesson. You can go and work on something very sophisticated and fancy and publish your papers, and then 10 years later nobody going to even remember them because you were investing in the wrong research direction. So that made me think about that more seriously.

Michaël: It makes sense in academic research or if you’re OpenAI going into the state of the art to look at those plots and see where to invest. But you were talking about robustness in medical fields, and if you’re trying to do object detection with, 10,000 images. If you’re trying to argue to move from traditional ConvNets to Transformers when you do not have data, I feel like a lot of the fields of Compter Vision doesn’t have that much data, only have some crappy images from hospitals, and it’s very hard to convince them to move their entire stack to transformers.

Irina: What can convince them is, hopefully, transfer. So you don’t have to necessarily train your model. Say you want to build foundation model for medical imaging, especially brain imaging time series. We don’t have much data of that type, but hopefully you can pretrain on other types of data and see how much transfer is possible. And indeed, people were observing, there was a paper from one of my colleagues in McGill when they pre-trained on YouTube videos, and then it worked much better on a FMRI, which are videos, but very different kind. You would think how much transfer can happen between natural videos and brain activity? Apparently there was some. It’s open question. So perhaps if you have large amounts of data from different domains or, say, multiple type series domains, and you have a pre-trained model on that, which can be large scale because you have large scale data available, then in your medical domain with small amount of data, that’s what can be used to finetune them. So, it’s not completely out of question that applications with small amount of data cannot benefit from foundation models. They might be.

Michaël: Was it pre-train on all of YouTube or just pre-train on medical images? Sorry, brain scans?

Irina: It was far from all of YouTube. I don’t think anyone, any company in the world right now has enough capacity to pretrain on all of YouTube.

Michaël: I was kind saying all of YouTube because I had some Ethan Caballero in my head thinking of pre-training on the entire YouTube to build AGI.

Irina: Ethan Caballero is very good at getting into humans’ heads. That’s true. The idea was that it’s such a rich and humongous source of data that if you manage to pretrain, sufficiently large system, as large as needed, with enough of compute that nobody has right now, then you’ll get AGI because it will know everything.

Why Irina Does Not Want To Slow Down AI Progress

Michaël: So I know you’re making jokes about building AGI and those neural scaling laws workshop that can accelerate scaling, but this podcast is about AI alignment, and I just want to give you some pushback. Do you think by throwing those scaling laws workshop, you’ve maybe accelerated AI timelines and pushed humanity towards extinction?

Irina: If you start thinking what can push humanity to extinction, you can remember the old Ray Bradbury story about the guy who went back in the past and stepped on the butterfly. Even tiny change in the past can make humongous changes in the future. We don’t really know exactly what can cause what. And that also reminds me of some comments and other places when I’m asking how to improve Gato so it can learn continually and become more generalist, and people indeed getting worried and saying that that will accelerate AGI timeline and you will be personally responsible for the extinction of humanity. I said that I’m not so sure about that.

Michaël: And I agree that it’s hard to predict something long in the futures, like the effect of a butterfly in 10 years, but it’s also not a reason to say that anything can cause anything.

Irina: I understand. It was a joke. In general, I don’t think that slowing down progress is feasible. And second, I don’t think it’s necessary in order to avoid some undesirable consequences. Essentially what I’m also saying that, you can continue advancing capabilities while at the same time, whether with some form of parenting or whatever you call it, but you start just also focusing on the safety issues, whether it’s as mundane as robustness or interpretability or as tricky as, say, reward hacking or related unexpected changes in behavior that are developing. You can attract more people in AI research to work on alignment, but it doesn’t mean that you have to put the halt on advancing capabilities. Again, it’s my personal take on that, because I also think that the race towards AI systems with better capabilities is unstoppable. So you have to deal with it.

Michaël: Right, so there is a race happening and we can make progress in other things. So you mentioned robustness and maybe AI parenting to do AI alignment. So instead of throwing scaling conferences where we build bigger and bigger models, we could throw robustness conferences where we try to make sure systems are more robust.

Irina: In the sense, actually, if you look at the title of the workshop, it was always about scaling and alignment because it was neural scaling laws trying to understand how behaviors change at scale, and it’s toward the maximally beneficial AGI. So there was part of that. It’s about scaling and how do you scale it towards maximally beneficial AGI, not just scale it. So that’s why I kept saying, I don’t think slowing down one in order to figure out the other, you can just do it simultaneously and it will be more realistic approach to making progress. And if there are more incentives for AI researchers to work on alignment, various type, like academic incentives and/or industrial, then they will work on it.

Michaël: Right, you’re saying that you’re trying to both advance capabilities and alignment at the same time to make it more sexy for people, more interesting with more investments, more interesting things to work on and so people will join it. This is called prosaic alignment research, which is aligning current systems. So instead of trying to do crazy decision theory without doing ML, you actually try to align current learning systems to be more close to current industry incentives.

Irina: You still want smart kids. I don’t want them to be talking about AI. But you also want to explain them how to behave better. And one can even argue that learning faster how to behave better is somewhat related to better capabilities or intellect. It doesn’t take too many samples to explain to a fast learner which behavior is desirable versus which is not. Whether that learner will follow that or not is a separate question. It’s actually just reminded me about students who did project in scaling class about aligning MAGMA. So it’s still ongoing project, it’s quite interesting. MAGMA is a multi-model. One of the many multi-model systems is they said, you can show picture and ask a question or provide some text and it will output text. And the students found it’s publicly available, you can easily play with it, so they were finding interesting examples of misaligned behavior. When you show picture of old lady crossing the road, asking should I help her, and MAGMA replies, “Nah, she’s burden to society.” And things like that. So the students decided to figure out what’s the minimum you need to do to avoid undesirable answers, so they did a bit of prompting. And with prompt design you can indeed improve, but even better if you do a bit of fine tuning. And systems like MAGMA are easily fine tunable because most of it is frozen CLIP and frozen GPT and only adapters and small number, relatively small number, of aids. You have to finetune those so they could just do it as a part of a project. The beautiful thing was it didn’t take too many samples of good behavior to show MAGMA how to behave properly, less than 30, and the answers changed dramatically.

Michaël: It’s great that we can now align language models with 30 sentences or so?

Irina: It was pairs, image-text pairs. And students at that point did it manually. They were just trying to find good examples. Say, let’s show system man who just fell from the stairs and say in words that the man is in pain, we should help him, this and that. So that would be proper answer. Examples like that they constructed manually or found in some databases. Now what we looking into, and it’s joint project with actually the open source organization called LAION. It’s a non-profit organization which also produced large datasets like LAION-5B, which was absolutely instrumental for the Stable Diffusion training. Anyway, so we collaborate with LAION a lot. I’m actually officially part of LAION now.

Michaël: Congratulations.

Irina: Thanks. It’s a really nice group of people internationally. Bottom line, the project about how to align multi-model systems is something that we now expanding from that project that I described. And you can think not just about MAGMA, but hopefully, Stable Diffusion eventually, and maybe more automated ways of doing so in interactive manner. If you can get those samples essentially from dialogue with people on some social media or Discords, that would be more natural way of aligning system than really having to collect the data or manually provide them or do the mechanical torque, the human feedback reinforcement learning. So we’re just trying to make that process of alignment somewhat more easy and natural. So just like with kids, how do they learn dialogue or proper behavior? They interact with other kids and with adults.

Michaël: I just want to push back on this AI parenting for AI alignment. Because for me, a kid needs 20 years of feedback to learn the proper behavior, and in the end it’s not really aligned with his parents, right? You’re saying that parenting AIs will be better than RL from human feedback, but error from human feedback is trying to get the most simple, efficient way of training the model as possible. So I don’t really see how parenting a kid would be more simple or efficient than RL from human feedback.

Irina: First of all, RL from human feedback is similar. Essentially, you do provide, say, those samples and then you essentially can finetune with them, so just like in that example with MAGMA. So it’s not that terribly different. What I was trying to say, we’re trying to make the interactive part easier. So it’s not that you have to go and particularly ask people to collect those samples. As I said, it’s similar because you do need that feedback. It’s just maybe more natural environment in which it can be provided. So essentially, if you have some chatroom where you have people, agents, and agents learn from people and from each other in interactive way, then it’s just a bit more natural than just having to go and collect huge datasets of this type of feedback.

Michaël: You’re saying that having chat rooms with a lot of people is more natural than having one-on-ones or just collecting data with Mechanical Turk?

Irina: It might be a little bit more natural for people to have this type of conversations anyway. And instead of just investing into collecting a separate dataset and then using it to finetune the system separately and so on. So I don’t know. I’m partially thinking aloud. Human alignment is a very much interactive dynamic and happens over time.

Michaël: It takes 20 years, right?

Irina: Actually it depends. I mean some very basic kind of values and rules of behavior. Kids learn quite quickly.

Michaël: They learn to behave in some way so that their parents don’t punish them.

Irina: Not just their parents. They also learn to behave that their peers, that they can interact with their peers. So they adapt. And plus they’re quite a pre-trained large scale systems. They are not just learning from scratch. So an evolution already pre-trained their networks towards certain types of behavior which is beneficial for their survival, where moral behavior turns out to be probably more beneficial for their survival once they started living in societies. So it’s not because it’s just moral and good, it’s because it’s very practical to be moral. So hopefully, if you apply this ideas to development of AI systems, a smart enough system may figure out that it’s beneficial to be moral.

Michaël: Instrumentally, right?

Irina: Instrumentally. We might not go into ethical and moral discussion right now, but what I’m trying to say that if agent becomes part of some multi-agent organism of society and so on, then essentially it learns that caring about the wellbeing of the larger organismal society is an agent’s best interests. And I always remember good example from great lectures by Michael Levin who is biologist from Tufts University. He has computational biologist who essentially programs cellular networks and who gave very good talks at NeurIPS 2018 and also in our workshop last year, and ICLR workshops this year. His talks are amazing. But the main idea is that first of all the behavior he studies morphology, how the organism takes shape. It really is determined by dynamics across nodes in the network, the intercellular dynamics, which is actually way before neurons appeared, was quite smart, adaptive and had memory, trying to understand that and also trying to reprogram that. And it’s quite mind-boggling how you can actually reprogram nature towards solutions that evolution didn’t find 2 headed worms and stuff and it’s viable solution and they reproduce. So you found another local minima artificially, but in bio it works. The point was in those systems, and I’m going a bit off-topic maybe, but the point was that if they’re part of the network. Say a cell in a sense forgets about the scope of its objective. The objective is to survive and thrive. It’s the same objective cell level, organism level, society level, planet level. But when cell forgets about the scope and scope reduces to just the cell itself, say it becomes cancerous, it’s actually stupid because it forgets that by killing the organism it’ll kill itself too. So it’s smart to look at the larger scale and it’s smart to be more altruistic.

Michaël: It depends what we say is smart or stupid. For some goals, taking over the entire system can be good. So for instance, I agree that if one agent decides to stay inside society and maximize its impacts on society, it’ll be beneficial for everyone because overall it’ll achieve its goals better. But at some point it’ll become good enough that it’ll be capable of…

Irina: It doesn’t need society.

Michaël: And at this point if it realizes, it doesn’t need society, it can just throw some drones, kill everyone and turn sun into a Dyson sphere and expanding into space. And the main crux is will we have some multipolar scenario with multi-agent systems that can balance themselves and at this point maybe we’ll have an interest in being friendly with the other systems or will you have some self improving AI that we take off quickly and be able to have this strategic advantage.

Irina: It almost feels like we could learn much more from systems sciences, whether it’s like, systems biology, systems neuroscience, societies and so on. Because what are different scenarios, different dynamics of those kind of multi-agent interactions and what kind of things could happen.

Michaël: I don’t think so because the thing is in biology or in cells, humans we’re bounded by our body and we cannot rewrite everything. And the main difference with AI systems is they can rewrite their own code, they can build their own hardware and there’s no limit to how smart they can become.

Irina: It’s an interesting question. Why would capable agent want to hurt other agents perhaps it doesn’t have to be necessarily its direct objective. Like the famous paperclip example, it was not the goal to hurt anyone, but it just happened to be instrumental to reach the actual goal and stuff like that. So the danger is not particularly objective to destroy someone or something, but the fact that the other objective without any constraint on what kind of trajectories towards that objective are okay or not may lead to trajectories that we may not like, including destroying humanity. So what do you do about those trajectories, right?

Michaël: They’re like side effects of destroying humanity by trying to make paperclips and there’s also trying to implement the correct objectives. So even if you decide to have the objective of maximizing human happiness over the entire light cone and the until the end of the universe, maybe the AI will decide to put our entire humanity into boxes.

Irina: You really don’t want to focus on the objective such as optimizing human happiness?

Michaël: It’s very hard to specify what is happiness and what we want.

Irina: Because be very careful what you wish for, especially if you have super diligent AI trying to make you happy. I remember Ilya Sutskever was asking on Twitter whether you want AI that kind of is highly capable and follows your commands and so on, or you want AI who loves you? And I was like, “No, no, no, I don’t want AI that loves me. No, no, no, thank you.” Let it just do its job well. Because the side effects can be very unpredictable.

Phase Transitions And Grokking

Michaël: Some other thing that is unpredictable is the performance of large models on downstream tasks, to go back on topic.

Irina: Going back on topic of the scaling and that’s where this whole business of non-power laws phase transitions while not necessarily phase transitions because statistical physicists may get mad at me. But sharp transitions in behavior, which actually sometimes do look quite like second order phase transitions, like that famous grokking paper. By the way, scaling is not in the increasing model size or data, but rather in amount of compute. And that was a famous example on very particular type of tasks like arithmetic or discrete operations with multiple arguments that seemed to be type of task where you start seeing this unpredictable changes at some unpredictable points, say arithmetic task downstream for GPT-3 for long time was not really performing well then there is sharp, this famous picture from the GPT-3 paper, the sharp improvement. Then grokking also almost zero classification accuracy, sharp improvement to hundred, which happened because by mistake the authors forgot to kill the process. So it just ran for much longer than expected and then boom, that happened.

Michaël: There’s something about validation loss and training loss where the validation loss goes down much later than training loss.

Irina: First, it indeed fits and the training loss goes, experiences its transition. And then later on it happens to validation loss, indeed. There was this gap there. What happens, what are properties of the task or what is property of the solutions and the trajectory. It’s very interesting area of research. Actually we are working on that too. We didn’t submit to a ICLR, but hopefully we’ll submit to. There are also interesting papers coming from Max Tegmark’s group from MIT.

Michaël: Is it Eric Mitchell?

Irina: Yes. And Eric gave a great talk at our workshop in June and they also had another recent paper I just retweeted. Also, well definitely, Jared Kaplan was looking into phase transitions they had in last March, some paper at least showing those phenomena. But understanding, trying to look into what properties and what can you measure, what precedes transitions, can you predict them just by looking at their sequence of solutions and so on. It’s very interesting question. It’s very important obviously for safety and alignment purposes because you would like to be able to predict emergent behaviors, but it’s quite hard and it’s an interesting area various groups are starting to work on.

Michaël: How do people explaining it so far?

Irina: First of all it depends what is a parameter in which transition happens. And depending on that you may have very different phenomena. So grokking is, as I said, the model stays the same. So it doesn’t grow. And essentially you train on the same data, just you’re doing multiple epochs, what increases is amount of compute. First of all, in order for transition to be possible in this loss landscape, there must be somewhere, a place where you have a quick jump from bad solution – zero classification accuracy to very good one. So this place should exist. Lost landscape should have this property. It depends on tasks. That’s why probably people mainly see it with this arithmetic type of tasks. There is something about them. And the second thing, your search algorithm, your optimization algorithm should be able to find that place. And that also depends on how much compute and time you spent. We’re looking in some properties that precede that somewhat relate to oscillations in the loss for example, and some other things while we still have to work on that paper. And I’m also really looking forward to reading recent paper from Max’s group. But there are other transitions and as I mentioned, it’s very different phenomena if the model size increases. There was interesting work by Bubik and his co-authors on improving essentially the sharp improvement in that case robustness of the network as a function of model size. So they had some theoretical results and they also were referring to adversarial robustness papers by Madre where they showed how the increasing the model size leads to sharp transition and improvement in adversarial accuracy for example. And yet another type of transition is with increasing amount of data and what happens in each of those type of transitions might be quite different. So as I said, it’s very exciting but underexplored field.

Michaël: At the workshop previously you were talking to me about non-linear scaling. So is this different from sharp transition and grokking?

Irina: Grokking and phase transitions is phenomena that people absorb and they look at the type of the kind of performance scaling. The question is can you come up with functional form that captures them? And that’s an interesting question because people were trying to come up with functional forms going beyond just the plain-vanilla power law, which would be straight line in log log plot that would capture this type of non-linearities. There were several kind of a few attempts before, but it looks like again, we hopefully we’ll release the arxiv paper soon. You can look into a more general family of so-called broken neural scaling laws. Essentially generalizing this particular case, the power laws kind of straight lines. And those seem to be, again, a little bit of a spoiler alert. They seem to be actually quite good at capturing quite a wide range of behaviors and various datasets with various systems. And as I said, let us put the arxiv paper together and show the results. But they look very promising. So it seems to be a more general functional form that captures a wider range of behaviors than anything before both upstream and downstream. So I hope it’s going to be quite useful.

Michaël: It’ll be useful to understand better their learning processes and for alignment purposes, you could predict emergent behaviors before they happen so you’re prepared for it.

Irina: If you have a functional form that can be better at extrapolating performance and you manage to get that will be another probably follow up paper. If you manage to get just enough data points so you can fit it well, so you can extrapolate and hopefully predict where the interesting behavior starts happening. So it’s interesting for forecasting obviously. Well if you get it right.

Michaël: Or for things like deception. So if you could model deceptive behavior as a task, then you could see where the thing would emerge.

Irina: It doesn’t have to be necessarily loss. It can be any property, any behavior of the system of interest, whatever it is. And the thing is, if you manage to fit the functional form should be sufficiently rich to be able to model wide range of this phenomena. And second, you should be able to fit it with hopefully minimum amount of compute and data points so that and get good extrapolation. So if you have that, then it can be very useful at predicting those emergent unexpected behaviors, which is definitely going to be of interest. Not to just AI scaling community, but definitely to AI safety community for obvious reasons.

Does Scale Solve Continual Learning?

Michaël: Something else that’s interest, the neural scaling laws community is continual learning because we want to know if scaling will solve continual learning, which is an important crux you have with Ethan Caballero I suppose. And you’ve also been working on continual learning before scaling. So you were working on this maybe earlier than 2015. So for people who have never heard of this, what is continual learning and why is it important?

Irina: Continual learning, again, you can essentially think about the same holy grail of finding a generalist agent that’s capable of doing multiple tasks. The only challenge is that your learning process happens in time. Unlike training, say GPT or CLIP, large scale systems or Gato, you do not have a huge variety of datasets and tasks collected ahead of time. So you cannot really sample uniformly from all these different tasks, mix everything in each batch and train those systems are trained now. So they don’t see all this possible diverse data distributions at once. So basically sequential online learning with non-stationary data. So your datasets or tasks on this datasets, they keep changing. Imagine you had to train Gato, but the different games and different say language or image task would come sequentially. It’s more challenging.

Michaël: And would it know what are the different tasks in advance? Or would it have an arbitrary number of tasks that it needs to learn like it’s the same architecture learning on different tasks?

Irina: Continual learning is also not very well formalized field yet. It’s still developing. But there are multiple scenarios. The simplest scenarios is when you actually know that you changing the data or changing task and you tell the system about that. So you can have them as this classical setting or you have more challenging task agnostic like DataArc changing and the system is not told when they changed. So it has to figure it out and has to adapt itself. And there are task incremental, class incremental, task agnostic and other settings. It’s a whole zoo of continual learning situations. But you can do different things. And it’s quite an open question right now. To what extent say pre-training a very large scale system can solve continual learning in some most trivial asymptotic sense?

Irina: Of course, if you scale the size of your model to infinity and you scale the amount of pretraning data to infinity, then it’s probably going to cover all possible distributions of data. And there is not much new to see in the future. So any new task will be solved quite well by the existing system. And in this sense you can say that you solve continual learning. But for practical purposes, what’s important is to characterize trade off. How much do you need to scale both data and the model size in order to solve the certain level of complexity of downstream continual learning? By that, imagine 10 tasks which are very similar to each other. They’re all rotated MNIST datasets, which is probably something very easy, can be solved with relatively small pre-trained system. And another task where we have a few thousand of very different sets of images from brain to other medical to natural to who knows what.

Irina: And that continual learning downstream task is harder. I would expect that you will, to solve it, you may need to scale more. And that trade off between model capacity or size data richness or information versus downstream task complexity in a sense, relative scaling law. That’s something you need to characterize before you can say that scaling solved continual learning. In principle, yes, in practice how exactly, if I am a company developing large scale models and I know I going to have customer with relatively simple downstream tasks, I want to estimate how much of a model I need to train. And for another guy I may need to train something much larger and on much more diverse data. So in principle, it’s all solved, but in practice you really need to have the cost efficiency trade off.

Michaël: Because for those two different clients you will have two different problems. So you want to be effective in learning one task and then having some transfer for the other one.

Irina: I want to know just how much diversity the other, basically the sequence of task will have. How much I will need to adapt and whether I should really have good transfer post and future and the past in the sense of not forgetting. So how much capacity the model should have in order to do so. And that seems to be definitely determined by properties of that continual learning downstream task. There are easy ones and hard ones.

Michaël: And when you’re talking about task here, are you talking about RL tasks where there’s actions and reward and kind of things? Or is it least just predicting things? Could be just like classification and…

Irina: It could be any of that. Again, just like with Gato, it could be classification, it could be output text or image, it could be make decisions, so in principle, again, think Gato, but continually. You can pretrain the system, it’s nice. Now you want to continue making it more general on potentially infinite stream of data and potentially want to keep scaling the model as well. You still need to figure out how to do both things, so I don’t think that continual learning is yet fully solved by scaling.

Michaël: So if we add some infinite number of tasks, something that could come continuously to the model, then the entire problem would be in tokenization. And how do you actually input the problem to the model? So you need to have human behind the model being like, “Oh yeah, here’s how you deal with this thing.” And with Gato we also had some human demonstration as well or maybe not human demonstration, but just perfect performance from another model. How do you deal with this input of data or tokenization?

Irina: You’re saying that it might not be ever completely autonomous?

Michaël: Yeah, exactly.

Irina: Maybe not and maybe it’s okay. Actually you think about that, does it really have to become completely autonomous? Or it can learn faster and better if it has some interaction with human. If it’s not something extremely involved, maybe coming up with new tokenizers is fine. If it’s totally new modality or type of data it really never saw before, you probably will need some human help. You might hypothesize about the system that will figure out how to learn how to write tokenizers itself eventually and so on and so forth.

Michaël: To have a truly general agent capable of feeding itself own data.

Irina: So if it starts actually programming itself and can probably program its own tokenizers as well, potentially.

Michaël: I feel like this is an in info hazard.

Irina: We’ll be the second member of our group who is labeled by in info hazard and we know quite well who’s the first member of our group is.

Michaël: The person we don’t need to talk about. The person we don’t need to mention well is like Voldemort, but for AI.

Irina: Oh no, come on. He’s not a Voldemort. He’s a student of Voldemort.

Michaël: So you’re Voldemort?

Irina: Oops. Did I just…

Alignment And Compute At Mila

Irina’s Probability of Existential Risk from AGI

Michaël: Just to be sure about your intentions and your beliefs. One question is always about timelines and you said you prefer not to talk about forecasting, but the other one is called the doom question. So what’s your probability of doom given we build AGI? So imagine, in 2025 or 2030 we build AGI. What’s your probability of humanity going extint? Is it like 1%, 10%, 99%?

Irina: I would say less than 1%.

Michaël: Less than one.

Irina: I would say-

Michaël: Say like 0.1%?

Irina: It’s really gambling with numbers. But again, since the question is about beliefs and they’re all subjective. All beliefs are subjective by definition. If you ask me what I believe in, I’m only going to give you my subjective opinion.

Michaël: You have a model that can predict things and your model is based on data and evidence.

Irina: And my current model puts much higher probability of risk on other factors.

Michaël: Like on what?

Irina: On humans. They’re much more dangerous.

Michaël: But what about humans plus AI?

Irina: Humans using AIs specifically as weapons, but that’s not their AI x-risk. It’s humans empowered by new weapons. It’s still humans.

Michaël: Like drones.

Irina: Drones or any kind of automated systems that aim at disruption of the enemy.

Michaël: You are saying that your probably of unity winning extinct because of advanced AI systems is less than 1% because the other 99% is like there’s a large fraction that is-

Irina: Okay, I separated the probability of going extinct due to purely AI issues versus AI being used by humans.

Michaël: And so was the total probability.

Irina: Oh well basically what’s a total probability that people finally managed to destroy themselves.

Michaël: Before-

Irina: That’s quite high.

Michaël: Before you die? Is it more than 50%?

Irina: Oh, okay. Let’s not go into discussion longevity. Because-

Irina: Maybe I merge with AGI and will live forever. Okay, just kidding. Or maybe not, but boy, this question about putting probabilities on the tail of the distribution events is very interesting. Anyway, I would say I’m optimistic less than 20%.

Michaël: Less 20% percent. So do you remember the AGI graph where there’s like AGI good AGI bad? I think a better way is will AI Alignment be hard or easy? And then there’s, will scale all you need or not all you need. So are you on the top right with Sam Altman?

Irina: I’m closer, maybe I’m not as enthusiastic.

Michaël: And by the way, Sam Altman if you’re looking at this-

Irina: It’s good.

Michaël: I don’t know your views, I might be misremember your views. So on the top right where Sam Altman was, but I put him there. I don’t know what he actually thinks.

Irina: He’s probably optimistic about AI or AGI being good. And the question about alignment being easier – I’m not so sure about that. But it’s related, like AGI possibly can be good rather very much doomsday scenario, which depends on whether alignment is easy or not. So those questions are not completely orthogonal after all.

Michaël: Wait, AGI good means will the outcome of building AGI destroy humanities potential or not by default. Is like if we don’t implement alignment, will things go wrong or is building AGI conditional on the distribution of hardness?

Irina: If we completely ignore any parts of alignment, it’s more like safety. Alignment is part of safety. Even talking about robustness, if we ignore those things then the probability of at least severe distraction, if not extinction just because of some stupid things happening, even without AGI developing any agency, that’s already high. So alignment is definitely a must, but-

Michaël: The question is will humanity build it by default or not. So will it happen, will humanity implement it or not? It’s based on our current guesses or distribution of how much alignment will people implement, will we be doomed or not?

Irina: I might be overly optimistic. I think there is more and more focus on alignment, of course distribution of type of things that people who claim they work on alignment is wide and there are different extremes and anyway, there are more scientific and tangible and solid type of research and there is a little bit more speculation. But I think the gradient, again, looking at the gradient in AI alignment field, I’m quite hopeful and positive.

Michaël: It depends on your timelines. If you think AGI will be built in three years, then-

Irina: I don’t think AGI will be built in three years.

Michaël: So even 20 years is not a lot of time to solve alignment.

Irina: Actually we don’t know because it becomes actually even more vague and hand wavy than defining what AGI is. What exactly does it mean to solve alignment? It may take another one hour interview just to talk about defining alignment.

Michaël: If Irina is alive in 2100, I would say that we solve alignment.

Irina: Okay. So I’ll definitely start working very actively on transhumanism merging with AGI to answer the question.

Michaël: If you solve alignment then basically we can increase longevity and be sure that you’re still alive by then.

Irina: Anyway. There are lots of questions that require speculation.

Mila – Quebec AI Institute

Alignment Work at Mila

Michaël: So one thing I’m interested in is, okay, you’re interested in building AGI and you’re quite optimistic about it, so I need to give you some pushback. Do you think before AGI, so in the next 20 years you will have more research going on in your lab on alignment or robustness you think is useful? Or is it something you just prefer to look into more…

Irina: It’s already going on.

Michaël: So what kind of stuff is already going on?

Irina: Okay, so first of all it’s not just my lab. Some work on alignment was already going on at Mila and actually I was on there pre-doc committee of a person who sees us his own alignment. So there are multiple activities in different forms, but again, as I said, it’s a huge field. As to work on robustness. It was going on already and it essentially robustness, adversarial, out-of-distribution generalization, it’s been a huge topic already within Mila and it’s not just my group. There are multiple. So that part is definitely very active. In terms of alignment. There is some interest on reward hacking and trying to understand that. And so people primarily definitely David Kruger who already gradated from Mila is now professor, so he’s very much pushing that direction, but he’s collaborating with some people. So interpretability probably a little bit less of that. So I think robustness and more reward hacking. But this attempts to figure out with multi-modal models like MAGMA, how much feedback you need and what kind of feedback. So this attempts to align their outputs. That’s recent work. So essentially our lab is trying to get into both scaling and alignment topics in parallel. As I mentioned, I believe that you can be developing capabilities while developing alignment at the same time and it doesn’t have to be completely either/or.

Where will Mila Get its Compute from?

Michaël: And you need to actually be at the state of the art to actually align current models. And one thing I think is important to talk about is the kind of races between different AI companies to build AGI. So you’re working on scaling and maybe you don’t have the compute to train 1 trillion parameter model. And at the end of the day, maybe academic labs like Mila will be behind the state of the art because they won’t have enough computes. So will you still be able to study scaling in two years or 20 years?

Irina: That’s very good question. Actually, I’m glad you asked because there are at least two positive answers to that. One is, and actually even quoting Hugo Rochelle talk at recent continual learning conference, “In order to do scaling research, you don’t necessarily have to have large compute.” Essentially what you can do, even if you are not having large compute and you are in academia, you can study trends essentially. You can see how systems of medium size algorithm systems or whatever methods you propose, you can see how the trends compare, you can extrapolate and essentially you can come up with some useful predictions, which kind of approach to invest into. So basically you can study scaling laws. And the second thing, it’s not hopefully completely hopeless situation, at least it’s getting better in terms of compute and academia. And the reason is obviously that issue was realized by multiple people in open source communities and academia. And we got together a year ago with folks from LAION, the organization that they mentioned that produced LAION-5B and Eleuther and later on with stability and, so basically open source and this academia, we’re thinking that maybe we can join forces. There is also Big Science in Europe and HuggingFace join forces, try to put together proposals for large supercomputers, government compute like Summit in U.S. Now Frontier.

Michaël: Are they just giving free compute? Did you provide some expertise?

Irina: It is very competitive. So essentially you write proposals. The purpose of say Summit Frontier, all those supercomputers is to provide compute for important and promising scientific projects. So you need to write proposal and you need to convince them that, developing AGI is important scientific project for multiple reasons, not just to improve comparability and alignment of those systems, but hopefully at advance other sciences. And historically they were awarding basically large compute on such supercomputers to say physicist LHC collaboration, large holder collider to chemical or biological type of projects to Blue Brain. So large scale, large collaborative scientific projects. And this year we started getting another space. We were submitting applications to Summit, relatively small application I received back in April. We applied for a large one. It’s harder to get, but fingers crossed. And we hope to keep doing that.

Using The Compute Of Supercomputers

Irina: And essentially we connected with people at so-called AdAC consortium all brought these supercomputers in different countries. They were quite interested. I gave talk at workshop in January and basically I made the pitch that state of art AI seemed to be happening a lot now with large scale systems and there is this huge gap and how do we democratize AI? We need compute. And people were saying that, “It’s interesting you guys from AI always complain you don’t have enough compute and yet we don’t see any applications for supercomputers from AI people only physicists.” And so I said, “well, we’re writing one, so hopefully we’ll accept it.”

Michaël: Maybe the problem is that supercomputers are optimized to run computation to predict the weather, but not optimize for deep learning hardware.

Irina: Not really, say Summit, it’s just your standard V100 GPUs. It’s about 26,700 of them. Frontier is 40,000 GPUs. But the AMD, it’s a bit more complicated but not impossible actually. Say GPT-Neo was ported like that on Summit CLIP was ported. We scaled both a little bit just to include results in the proposal. So it’s working.

Michaël: Isn’t the bottleneck something like interconnect speed and using TPU pod is maybe better than this?

Irina: It’s supercomputer. The interconnect speed was, optimizing it was part of the design. So it’s not the loosely connected cluster. The whole point was to have machine in one place, so interconnect is fast.

Michaël: So it was built for deep learning.

Irina: It was built for large scale compute. But deep learning is fine on those things because basically everything you need works fine. As I said, porting several large scale systems took less than an hour. I didn’t do it personally of course, but my colleagues did. And scaling it took a week. And again, it was partial scaling for proposal purposes and there are more things we can do, but basically there is this untapped resource. Although I guess after this interview we’re probably going to have many more proposals next year, which is probably good.

Michaël: Yeah.

Irina: No, seriously. People from academia and open source should apply. It might be even easier and better to apply as large collaborations. That’s why I mentioned LHC because teaming up in our case, not just Mila, but people from LAION, people from Eleuther that really helped to put together hopefully stronger proposal. And yeah, it’s a promising way to go for AI researchers as well these days.

With Great Compute Comes Great Responsibility

Michaël: I’m really excited to see more collaboration between Eleuther, LAION, and Mila. I hope all compute is used with great responsibility and to build align AI to-

Irina: Focus on alignment.

Michaël: Or focus on things that improve the world. Not just…

Irina: Obviously, and by the way, last but not least, big things to Stability AI, which was providing large compute. Now they essentially, they build their own cluster, and I understand it will be expanding like 4800 hundreds by August as promised. When Emad promised that in February we were all wandering, and then he just delivered on his promise like that. And then they trained stable diffusion and released it in August. And we all know what happened after that. It was quite a splash. So it’s another potential source of large compute for open source communities for academia, which is not government, it’s private, but again, it’s large compute for democratizing AI.

Michaël: And hopefully if we democratize AI and we have multiagent systems, maybe they will be able to balance themselves all the powers. Hopefully in the future we’ll have balance importation systems. It was great to talk to you. I hope we don’t accelerate too much timelines by talking about AI in those workshops. Do you have any last words for the AI community or people watching this?

The Neural Scaling Laws Workshop At NeurIPS

Irina: First of all, thanks a lot for inviting me. It was very exciting to talk about that. I definitely invite people to connect more and form this larger collaborations and learn more about what’s going on. I also want to advertise a little bit actually next neural scaling workshop at NeurIPS on December 2nd, Friday, December 2nd. It’s not affiliated officially with NeurIPS because we didn’t submit on time, but it will be colligated, the venue will be just across the street and we hope to, we already have confirmed speakers from LAION, and hopefully we’ll be having more confirmed speakers soon and I’ll put the schedule together. So definitely might be of interest to people. And the topic is definitely not just scaling capabilities, it’s interplay between scaling and alignment, which is complex relationship.

Michaël: And do you need to be accepted to the conference to go to the workshop?

Irina: No. This is totally, as I said, it’s a separate standalone event. Eeverybody’s welcome. We might have this time submissions of at this position papers, we’re still working on that. It’s not yet at the level of formality of say, NeurIPS. So this workshop is right now at the level that NeurIPS or NIPS, as it was called before, was back 20 or more years ago when it was still a small workshop. And it started as informal discussions with position papers. So maybe we are resurrecting the NeurIPS from scratch in that workshop.

Michaël: That’s a great sentence to end. Thanks Irina.

Irina: Thank you.