2022-06-24

Raphaël Millière Contra Scaling Maximalism

Raphaël Millière is a Presidential Scholar in Society and Neuroscience at Columbia University. He has previously completed a PhD in philosophy in Oxford, is interested in the philosophy of mind, cognitive science, and artificial intelligence, and has recently been discussing at length the current progress in AI with popular Twitter threads on GPT-3, Dalle-2 and a thesis he called “scaling maximalism”. Raphaël is also co-organizing with Gary Marcus a workshop about compositionality in AI at the end of the month.

Raphaël : https://twitter.com/raphaelmilliere

Lesswrong: Raphaël Millière on Generalization and Scaling Maximalism (1 comment)

Effective Altruism Forum: Raphaël Millière on the Limits of Deep Learning and AI x-risk skepticism

00:00:00 introduction
00:00:36 definitions of artificial general intelligence
00:07:25 behavior correlates of intelligence, chinese room
00:19:11 natural language understanding, the octopus test, linguistics, semantics
00:33:05 generating philosophy with GPT-3, college essays grades, bullshit
00:42:45 stochastic Chameleon, out of distribution generalization
00:51:19 three levels of generalization, the Wozniak test
00:59:38 AI progress spectrum, scaling maximalism
01:15:06 bitter Lesson
01:23:08 what would convince him that scale is all we need
01:27:04 unsupervised learning, lifelong learning
01:35:33 goalpost moving
01:43:30 what researchers should be doing, nuclear risk, climate change
01:57:24 compositionality, structured representations
02:05:57 conceptual blending, complex syntactic structure, variable binding
02:11:51 Raphaël’s experience with DALL-E
02:19:02 the future of image generation
02:23:43 conclusion

Introduction

Michaël: Raphael. You’re a Presidential Scholar in Society and Neuroscience at Columbia University. You have previously completed a PhD in Philosophy in Oxford. You’re interested in the philosophy of mind, cohesive science and artificial intelligence. The past couple of years you’ve been discussing at length the recent progress in AI, especially with popular twitter threads on GPT-3, DALL-E 2, and AGI. You’ve recently been highly critical of a thesis you call, scaling maximalism and that’s one of the reason you’re here today. Thanks Raphael for coming on the show.

Raphaël: Thanks for having me.

Definitions of Artificial General Intelligence

Michaël: Before we go into too much details about the recent progress in AI. think we should start like all good philosophers by some definitions that will help ground the discussion. One of the most heavily loaded term I use on the podcast is AGI, for artificial general intelligence. Could you maybe give us a couple of definitions of AGI that people use in practice?

Raphaël: Right. Yeah, I think the term AGI is one of these terms that annoy a number of people because you can define it in various ways and it has been defined in various ways by various people. So there is looming threat that you will engage in a verbal dispute where you’re just talking past each other because you don’t have the same definition of the term. And that’s, I think a common concern in discussions of AI, but also some of these definitions of AGI might even be, if not unhelpful, maybe even borderline incoherent. That’s something that Yann LeCun, François Chollet and other people have suggested. So I think the definition of AGI that people tend to find incoherent is the idea of a maximally general intelligence or universal intelligence, right? Maybe we can come back to this later, but if you think of things like the no free lunch theorem and the idea that even human intelligence is rather specialized in some ways, and this kind of consideration then you might think that this notion of AGI as maximally general or universal is misguided.

Raphaël: And then, one I think rather broad definition that I would be on board with would be something like: “a broad capacity for skill acquisition range across a broad range of different tasks with some limited amount of prior knowledge and prior experience required for generalizing across these tasks”. Right? So there, you have the general aspect of intelligence as a spectrum where your generalization power across new tasks specifically unknown tasks that are widely out of the range of tasks you’ve encountered before is one of the key aspects of this generality. And so then you can compare different systems with respect to how general their intelligence is with this broad definition. And you might consider, and I would consider personally, for example, with this definition in mind that some non-human animals exhibit some form of general biological or natural intelligence, but it’s interesting because when you talk about AGI with AI people, people these days think immediately of something like super human intelligence, rather than something like rodent level intelligence, which is also fairly general, if you think of what a rat can do.

Raphaël: But building on this definition, because it’s a definition that’s where the generality is relative and pertains to comparing different system. Then you can extrapolate from there and define a kind of intelligence that is more general than that of a rat. And in fact than that of a human possibly, because I don’t think many people would be suggesting that humans have in principle, the pinnacle of generalization power, right? So you can think potentially of a system or a creature that could generalize even better than humans or across a broader range of tasks. So then you get into this idea of AGI that’s related to superhuman intelligence, but I think we should indeed leave the maximalist definition of the idea of universal intelligence to the side because it’s not very helpful.

Michaël: So you would get some general animal intelligence and then general human intelligence. And then after that you would get superintelligence.

Raphaël: Superhuman intelligence. Yeah. I mean, it’s tricky though, of course, because many people would argue that intelligence is a multidimensional construct. It’s perhaps also depending on how exactly you define it could be context sensitive in some ways. And in that respect, you can have a system… I mean, this is a banal observation these days, but you can have a system that has superhuman intelligence in some respect and subhuman intelligence in another. If you think of it in terms of narrow skills, this is very obvious, AlphaGo has super human abilities in Go playing, but is completely incapable of doing a bunch of other things we can do.

Michaël: Yeah, that’s why I think the definition of superintelligence from ‘s book was the same name is useful because it defines it as having orders of magnitude, more intelligence than humans in economically viable tasks. So the whole of what humans would consider useful for our economy, machines could do it orders of magnitude, like 10 times faster or better. And how we define intelligence is that, in a very practical way, where intelligence is just the ability to achieve your goal. So if the goal is designing chips or driving a car, if you’re 10 times safer in driving the car, or you produce 10 times more chips per hour, then you’re more intelligent in that sense. And so your definition of superhuman intelligence with different levels of generality, I would say that it’s interesting, but in practice we would not really care because it would be out of our scope. And I think when we think about the future, we care about human level and we care when it’s smarter than us in a way that we’re not really useful anymore. What do you make off those definitions?

Behavior correlates of Intelligence, Chinese Room

Raphaël: So I’m not a huge fan of this characterization of general artificial intelligence or artificial general intelligence as a characterization of intelligence, because I think it’s not really a definition of intelligence. It’s more about behavioral correlates of intelligence, given a specific set of background assumptions. So the range of tasks that are economically valuable in a given society or world or universe is very context relative. So if you think that there is nuclear war and global collapse of society, perhaps the range of economically valuable tasks that we care about and that humans can accomplish will be vastly restricted, such that a system that today we would consider fairly unintelligent could be better than humans at this task. That might just boil onto basic farming tasks, for example. So this is why I’m not a big fan of this definition because I don’t think it’s really a definition of intelligence. It’s more of a characterization of some behavioral correlates of general intelligence, given a set of background assumptions about what humans find economically valuable.

Michaël: Yeah. So I think you could always find some specific states of our economy or world where we’re at nuclear war and there’s only this and this that is useful. And okay so the definition adapts to those situations, but at the end of the day, what counts is what fraction of the GDP is produced by humans versus machines. And if I don’t know, 20% of our economy is automated. Then that gives us roughly some sense on where we are in terms of automation. And when we get to 98%. So for me, it’s more like a continuum, right. And before we reach a hundred percent, we might get to 95%. And the only people doing useful work would be the AI researchers. And I think it’s interesting to come back to what you said about behavioral. Sorry about my English. Yeah. How would you define behavioral intelligence. What’s the difference in philosophy between behaviors and knowledge, or understanding, if any?

Raphaël: Yeah. So you have a lot of definitions of intelligence and specifically AI or artificial intelligence that focus on behavior because it’s easier. So behavior would be any observable output of the system, whether it’s generated text or images for systems like GPT-3 or 2, or whether it’s actually acting out in the world, if it’s some embodied system like a robot or acting in a virtual environment for a reinforcement learning agent or something like that. So you can focus on behavior. And that was in fact, the strategy that Alan Turing adopted in his seminar paper on whether machines can think. His claim was that we should not bother too much with the question can machines think, but we should instead focus on a behavioral test, which is the imitation game, the Turing test.

Raphaël: And this is the same definition that was then favored by people like Minsky and McCarthy and others, some of the funding fathers of AI in the ’50s, namely that we should focus on the range of tasks that a system can achieve. And so you can contrast this behavioral definition of intelligence with definitions that appeal to intrinsic capacities of systems.

Raphaël: Sometimes I call these definitions mentalist definitions because they often appeal to mental capacities or cognitive capacities. You could call them cognitive definitions as well. That would be a nice parallel with the evolution of cognitive science or psychology generally from behaviorism to the cognitive turn in the ’60s. So once upon time, psychology was entirely focused on the observation of behavior and correlations between inputs and outputs, and then moved towards making hypothesis about cognitive processes. And what’s going on the inside under the hood, in systems like animals and humans. And so similarly, you can focus on a definition of intelligence that appeals to cognitive skills, cognitive capacities, and I’m personally quite sympathetic to that second approach to intelligence.

Raphaël: I think if you really want to disentangle, what is really an instance of intelligence and what is not, you need to make some informed hypothesis about the mechanisms that are at play in these systems, right? You see that with GPT-3 and language models, generally these days people are amazed by their performance, but there are endless discussions about what that performance actually means and how it can be explained by what’s actually happening in the system. And I think we can’t escape these kind of discussion about what’s going on inside the system.

Michaël: In the context of a Chinese room experiment. Do you think the system is actually intelligent or not?

Raphaël: By the system? You mean GPT-3?

Michaël: When you have some human in a room and he has a manual to translate Chinese. And for people who were not familiar, you get an entry some input to get some English text, and then you have a big dictionary when you can map English words to Chinese words, with some very specific rules. And at the end it outputs translated text.

Raphaël: Right? So, the Chinese room argument from John Searle is this old argument in philosophy that purports to show that the kind of, at least algorithms that existed at the time could not have any understanding of language. So that idea is that you have a person, an operator in a room, that gets as input some Chinese symbols and has this big rule book or manual. That’s essentially a giant lookup table that can match input strings to output strings based on various rules. And so this operator in the room is an English speaker who doesn’t speak any Chinese, and yet can take these inputs, Chinese symbols look up what is the corresponding output in the manual, and then output Chinese symbols. And through that procedure could converse with some Chinese speakers outside of the room in a way that would be indistinguishable from a fluent Chinese speaker, even though the operator in the room has no understanding of Chinese and John Searle pumps intuitions with that kind of thought experiment to suggest that machines can’t have any understanding.

Raphaël: It’s an infamous thought experiment in philosophy. I would say most people, especially working on artificial intelligence these days range from being indifferent to that experiment to being pretty annoyed by it. And by its enduring popularity, because there are a lot of things that you can dispute about how this thought experiment is set up, how it’s pumping intuitions. And there is also an ongoing discussion about the real value of thought experiments in science and philosophy and how much you can infer from these thought experiments. But I think, in the way the thought experiment is set up, with this slow human operator looking at a giant lookup table. My intuition is that indeed neither the operator, not perhaps the whole system, that includes the operator and the room has any understanding of Chinese.

Raphaël: Now, if you modify the experiment to look a bit more like modern language models, then things get less obvious. But I think one of the key replies to the Chinese arguments, the so-called systems reply is that you shouldn’t look at whether the operator in the room has an understanding of Chinese, but that whether the whole system, including the operator and the rule book, the manual and the room itself has an understanding of Chinese. Because in the analogy, the operator is a bit like the CPU executing instructions, and you would have to include the whole system that includes, the state of the algorithm encoded in the manual. And I think the whole mechanism that gets you from input to output and so on, but anyway not a huge fan of that particular thought experiment.

Michaël: Yeah. I think my point was mostly that if you consider the whole system, an example of a system that could have a behavior that is useful, like translating English to Chinese, without having some understanding. And I guess what you meant was replacing this rule book with a transformer. When you get a large language model that translates effectively English to Chinese without actually having an understanding of the world. And it wouldn’t really be able to explain thoroughly what is a cup of water and why humans drink water, but it’s still able to translate all the sentences related to cup of waters.

Michaël: I think that’s somewhat similar, even if the technology is quite different from the 1960s to today. And you’re mostly concerned about understanding, and at the end of the day, we might end up with AI’s that are similar to transformers and GPT-3, and that could have an impact in the world without having a perfect understanding of everything they’re doing, but still have a massive economic impact. And yeah, perhaps if we include in understanding, like having some human experience of life, maybe they will never have a human experience or a consciousness or anything related, but they will still be able to transform the world as we know it.

Natural Language Understanding, the Octopus test, Linguistics, Semantics

Raphaël: Yeah. So I think there are a number of things we have to keep distinct here in this discussion. So one thing is whether or not a system has the kind of impact that a system has in the real world in terms of, for example, accomplishing economically valuable tasks, or being able to perform a bunch of other tasks that we care about. And it’s true that people in AI, in the industry, especially, and it’s perhaps less true for some labs or some groups. DeepMind, for example, cares more about some of the more substantive questions about the capacities of the systems, but a lot of groups in the AI industry care more about building things that can do certain things and don’t care that much about how they do these things as long as they can do them. So a lot people building language models, for example, don’t really care that much about whether these models have an understanding of language and so on.

Raphaël: Now, I think other things we should keep separate. So there’s one question, which is whether algorithms like language models have an understanding of natural language and the meaning of words and sentences. There is a question that’s related but distinct about whether algorithms, whether language models or other, for instance, reinforcement learning algorithms have an understanding of the world. So these two things are related, but not quite equivalent. And I mean, perhaps we can come back later at some point on to the modernized version of the Chinese room argument that has been posed by Bender and Koller in the form of the Octopus Tests, even in a paper called, Climbing towards Natural Kanguage Understanding, that was published a few years ago. That’s an argument that pertains to language understanding, and the idea is the following.

Raphaël: (It’s trying to pump intuitions in a way that is more adapted to the kind of technology we have today, as opposed to the good old fashion algorithms that John Searle was concerned with.) Imagine that you have two remote islands, somewhere in the Pacific Ocean that are not too far away from each other, and you have stranded humans, one on each of these islands that communicate with each other through a telegraph system with a deep sea cable. And you have an octopus, a very, very smart octopus that is somehow tapping into this deep sea cable and starting to listen to the conversations between the two humans.

Raphaël: Now imagine that again… all these thought experiments require a lot of imagination and that’s part of the problem… but imagine that this octopus somehow can, by eavesdropping on the conversations within the two humans, generally, gradually, build some statistical model of language based on the distributional statistics of how words are used by these two humans to such an extent that eventually it can hack the communication system and insert itself and talk with either one of the humans or maybe both of them, but maybe just one of them pretending to be the other human, right? So you can just intercept communications and replace them for example. Based on the information it has gathered about the distributional statistics of words, just by listening in on conversations, it might be the case that this smart octopus is able to do a very good job at convincing humans on the islands that they are talking to the other human when in fact, they are interacting with the octopus that is just outputting some strings based on distributional information.

Raphaël: Now what Bender and Koller say is that the octopus, it’s a deep sea creature that has no interaction with the world above the surface. Wouldn’t truly understand what it’s talking about when it’s talking about things that’s happening on these island success. For example, I can’t remember exactly how to set up this part of the thought experiment, but something like if one of the humans is concerned with how to build a catapult, the octopus might output some sentences about that, but would have no underlying understanding of how a catapult works, what is required to build one with a coconut and so on, and how that works in the real world, right? And the broader point of Bender and Koller’s paper is to say, well, there is a distinction in linguistics between at least in classical Saussurean linguistics between things like form and meaning.

Raphaël: And so all the octopus is interacting with is the form of linguistic items. So that kind of text string, it depends on whether this is audio communication or text communication, let’s assume it’s just text written text. So it can only gather information about the distributional statistics of these text strings. So the form of linguistic items. But it doesn’t really have any grasp of the meaning of linguistic items where meaning isn’t this [inaudible] in terms of some kind of relationships to the world and to the reference of words in the world.

Raphaël: Now, I’m ambivalent about this whole argument. On the one hand… That’s probably going to be the theme of this conversation because I generally take the middle ground between hard line skepticism and AI hype and evangelism… So I think it’s right to say that systems like this fictional octopus or current language models don’t understand language in the way humans do, they don’t learn language and acquire language in the way children do. And there are various reasons to think that their understanding of language is at the very least quite limited. That being said, I think there are a few things in that particular paper and thought experiment that are debatable.

Raphaël: One is the very stark distinction between form and meaning because this distinction might not always be so absolute. There are some domains in which the form of linguistic items has some nontrivial relationship to the meaning of linguistic items. There’s this thing called sound symbolism, for example, in linguistics they have experiments on this, if you ask people to relate, made up words to shapes they will tend to associate the made up word kiki with a spiky shape, more than a round shape and vice versa for other words that sound differently. So there are formal properties of linguistic items that can have nontrivial relationships to their reference.

Raphaël: We probably can’t go into details on that. And I’m currently working on a paper that gets into these ideas, but I think there are other ways in which the form meaning distinction is a little bit more porous and fuzzy than Bender and Koller allow it to be. So that’s one thing, and then the other thing is that they’re bringing some considerations about the modeling of communicative intent for example, as one of the shortcomings of the octopus, and language models. So it doesn’t really have any notion of the communicative intent of the human speakers on the islands. And I think that’s also bringing a notion it’s perhaps muddying the waters a little bit because it’s bringing together different issues.

Raphaël: One of them is the classic notion of referential grounding. So grounding your understanding of words into some capacity to grasp the reference out there in the real world, presumably through some direct or indirect causal historical interaction with the reference in the real world. And another issue is modeling of communicative intent. So trying to understand what exactly someone means in conversation, there might be some misunderstandings of communicative intent, even in our conversation in this podcast, right. Even though we do both have the capacity to ground our understanding of word meanings into some knowledge about that reference in the world. I think the modeling of communicative intent is a distinct issue that comes apart from the share of referential grounding. So, that’s another aspect in which I think this paper might conflate issues that we should keep separate.

Raphaël: I have this concern and I don’t want to overstate that claim too much, but I would be prepared to say that language models, in principle, language models of the kinds we have today can have some form of semantic competence. That falls short of what we would call human level language understanding for various reasons, but that also is not equivalent to something like ELIZA in the 60s, the kind of symbol manipulations it was limited to. Or indeed equivalent to Stochastic Parrots, if you take this influential metaphor proposed also by Emily Bender and Timnit Gebru and colleagues. So I think the Stochastic Parrots metaphor is short selling the competence of language models. And I have this middle view that tries to do justice to what these models can do without inflating and anthropomorphizing what they can do.

Michaël: I think most people in the ML community would agree that we are very far from ELIZA right now and they have some competence and they’re still beneath semanting understanding from humans. So, I don’t think your take is super controversial.

Raphaël: No, I agree. I think it should be a very reasonable take. But it’s interesting because it’s probably a silent majority view. But, discussions, debates on social media, like Twitter, for example, tend to give more visibility to the more extreme polarized views perhaps.

Michaël: Sure. And, the previous guest on our podcast was Ethan Caballero, he was on an extreme that you called scaling maximum that we will talk about later. Just to go back to your analogy with the octopus. I like it because the octopus has eight arms, a transformer with eight attention heads or something. And it’s really close to modeling the understanding of a large language model. And in some sense, what you said about communication intent. So to have communication intent, you need to say something because you want an impact in the world. You want a consequence. I say something to you because I want you to get the information and transmitting and maybe do something about it. But there’s also the way in which when you start a word and then you start a sentence or a paragraph, you try to push the essay in a certain direction.

Generating Philosophy with GPT-3, College Essays Grades, Bullshit

Michaël: And if you’re only predicting the next token without having a clear plan of what you’re going to say in three pages, then you’re doing this Stochastic parrot thing where you’re saying something coherent without really aiming for a particular direction in your wording. And I think one of the first time I’ve seen your name was two years ago when you did some relatively impressive text generation with GPT-3 where you gave as input text from famous philosophers, so GPT-3 had to answer those and you maybe generated multiple times certain paragraphs to have the right ones. But if you want to just talk a little bit about it, I would be curious to hear take on this.

Raphaël: So that was in the early days of GPT-3. It seems like so long ago already.

Michaël: It is so long ago, two years ago.

Raphaël: Which is a lifetime in AI research. Exactly. It was in the early days. And so I was playing with early access to the model and to the API. And at the time there was this blog post that was making the rounds on the website called Daily News, that’s a website used by the philosophy profession. And there was a guest post on that website, which was an edited collection of very short essays about GPT-3 from various philosophers. And I thought at the time that in my opinion, some of these essays were a little bit missing the importance or the significance of GPT-3 as a technological innovation, but also falling into some of the common pitfalls such as saying, oh, well, it’s just statistical pattern matching or things like that, which is rather uninformative because you could describe a lot of what humans do as statistical pattern matching as well.

Raphaël: But anyway, I thought just for fun, I would collect these essays together and prompt GPT-3 with these essays and ask GPT-3 to produce an essay in response to these essays. And what I did then, and I further had to clarify this on Twitter because people were jumping at my throat for cherry picking. So I did generate a few outputs for each two, three paragraphs and then mix and match the results to produce a longer essay. But there was not a lot of cherry picking honestly.

Raphaël: And the essay we produced was very impressive in my opinion. I mean, again, with some editing on my part, but something that required really minimal effort for me and if it was coming from one of my first year philosophy students would not be a great essay, but wouldn’t be the worst essay either. It would be a run-off-the-mill, slightly mediocre, but not awful essay, would get probably a fairly decent grade, especially with great inflation these days. I was telling my colleagues on Twitter, watch out for this stuff. It’s coming to college students and you have to think twice in the future about the kind of assignments you give to students in terms of essay writing.

Michaël: I’m curious, do you see a future for philosophy professors in correcting essays? How do you grade essays when in 2025, when people will have access to something that doesn’t generate three paragraphs, but maybe entire pages or multiple pages? How will you keep up with those advances?

Raphaël: It’s a hard question. It depends on what you’re trying to assess. So if you’re just trying to assess stuff like general knowledge of the kind of encyclopedic knowledge about a subject. So say I teach a class on the philosophy of AI, and I want to see whether my students have memorized where the Chinese Room is and what are the main objections to the Chinese Room and so on, Chinese Room arguments. Then, giving that out as a essay at home would be worthless at assessing that, because that was already the case before, right? Because people can just look things up on Wikipedia and just copy past. We have plagiarism detectors, of course, Turnitin and this kind of software. But you can always rephrase, paraphrase Wikipedia and that wouldn’t trigger the detectors.

Raphaël: And so in that sense, the advent of language models is not really changing anything. Students can paraphrase stuff they see on the internet. Now, if you want to assess reasoning and argumentative skills, in addition to knowledge, that’s where things get a little bit murkier. I would argue with current transformer models, there are still questions you can ask to students where really showing a deep understanding and capacity to reason about the topic is out of reach for current models. So if you ask the right questions, I think, you would still be hard pressed to find a relevant, efficient use of these models. But how long is that going to be true? I don’t know.

Michaël: When will you be out of job?

Raphaël: Well, I don’t think the job of a philosophy professor is to grade essays, thank God, because that’s probably the worst part of the job. So I would be happy for this to go away. And in fact, I’m not the biggest cheerleader of grades. So I think, if you think of it in terms of reinforcement learning, the kind of reward or punishments that you get from grades is not always the best learning signal. And there are multiple, many studies about this in pedagogy and psychology, and in some ways I think if we shift away from this narrow graded assignment, that wouldn’t be such a bad thing. But if you want to stick with essay writing, you can always assign essays in class in limited time, which is something that we do a lot in France, for example.

Raphaël: We don’t do that much, as you know. We don’t do that much essay writing at home, right? We do this thing called dissertation in class in limited time. In fact, in my own education, because I went through the French system that goes crazy for this kind of assignment, I went through this entrance exam for this school called Ecole Normale Superieure, and then took this exam called Aggregation, and for this you have seven hour long essays in class, right? With someone watching you. So you have to sit for seven hours and write an essay under close supervision without access to a computer or dictionary or anything. So, I mean, if that’s the kind of assignment you want to give. I’m not personally convinced it’s the right kind of assignment, but you can always do that, right?

Michaël: And at the time you will have brain computer interfaces dictating GPT-5 output in your brain directly.

Raphaël: Exactly. I’ll be plugged in through GPT-20 with Neuralink. But, I mean, more seriously though, to come back to these experiments with GPT-3 initially, it prompted me to write this essay at the time, this general audience essay about transformer models. And what struck me is that one way to characterize what they do, which I think still today is perhaps more informative than this Stochastic Parrots metaphor is that they’re very good bullshitters. In a technical sense of bullshit that was given by the philosopher Harry Frankfurt. So he wrote a very fun essay called Unbullshit. And he gives this definition of bullshit which is basically producing some kind of outputs such as text or speech, that’s purely optimized to convince people or compel people without any intrinsic regard for truth of falsity, right? So that’s what a bullshitter is doing, and I think straightforwardly what these autoaggressive models are doing is very good bullshit, right?

Raphaël: It’s producing statistically likely strings one after the other. So statistically plausible speech without any intrinsic regard for truth of falsity. And so this is why I think, even the Stochastic Parrots metaphor is short selling what these models can do a little bit because they’re not literally parroting speech, right? So I think there is this sentence from the Stochatic Parrots paper that these models are parrots because they’re haphazardly stitching together sequences from the training data. And I think we have good evidence that this is not actually what these models are doing. There is some memorization that can happen, but by and large, they can produce completely novel outputs that are not just a matter of stitching together n-grams from the training data. What they do is much more sophisticated.

Stochastic Chameleon, Out of Distribution Generalization

Raphaël: And so I think if you want to have a somewhat deflationary take on what they do, perhaps a better metaphor and a more accurate metaphor would be a Stochastic Chameleon, something that can seamlessly blend in different domains and different regimes of speech and different topics, different styles. You see it with also text to image models, seamlessly adopt the style of different painters or adopt the stylistic, the mannerisms of speech of different authors and talk about any topic without any intrinsic regard for truth of falsity.

Raphaël: And now then the question is, when do we cross over and move beyond this artificial mimicry, this Stochastic Chameleon behavior, crossover into something that looks more like actual reasoning and understanding. And, of course, that’s the million dollar question.

Michaël: I think it’s always a question of out of distribution generalization. So the chameleon you describe is able to fit in high dimensional space between a couple of data points, and you can transition between one data point to another in a smooth way, because the chameleon changes his color. And at that moment you’re doing high dimensional interpolation which is a hard problem. But you’re still inside your training data. And I think the most robust or difficult definition of intelligence or general intelligence would be to be able to generalize to out of distribution. You’re seeing something from an alien life form, and then you get the problem of “no free lunch theorem”, “it’s impossible to generalize to anything at all”. But I think with Gato or those RL tasks that have been going on in the past few weeks, you get some generalization and you’re able to do very well on a bunch of different tasks.

Michaël: I think one way to measure if you’re really, I don’t know, a monkey, I think a monkey is able to learn tasks very well. If you’re not a stochastic chameleon, but an actual monkey or chimpanzee, able to learn new tasks, then you’d be able to learn Zero-shot or Few-shot new task, right? And what GPT-3 showed was that it was able to be fine tuned to completely new domains and get good performance. And on top of that do very well few-shot learning on some arithmetic task or something else. So, I think we’re getting closer to animal intelligence and human intelligence in that regard.

Raphaël: I think that’s right. I mean, indeed what you can think of what the chameleon is doing as sampling color space and then interpolating within color space, right? And you can similarly think of what some deep learning models are doing as just something, projecting some features of the input into latent spaces and then interpolating within latent space. And I think, people like François Chollet, for example, have been going on for a long time about how that’s the big limitation of deep learning. So, I think someone like him think that this is an intrinsic limitation of deep learning, that it’s always dealing with continuous interpolative data that lies on some kind of manifold and can only do geometric transforms from one manifold to another, that doesn’t enable true extrapolation and generalization.

Raphaël: But, I mean, you have some debates about this. There’s this paper from Yann LeCun and others. I can’t exactly remember, arguing that interpolation in very high dimensional space is amongst to extrapolation or something along these lines, right? So, it probably depends on how exactly you define extrapolation and interpolation. There is also probably some verbal disputes going on there, but I do agree that we are heading towards greater and greater capacities for generalization. That being said, I think we shouldn’t overstate the kind of generalization that current models can do, right? So if you take the Gato, I’m not sure how it’s supposed to be pronounced, but paper from DeepMind, what’s actually striking is that there’s not that much Transfer Learning going on, from what I’ve been reading, right?

Raphaël: So it’s actually, it’s just trying on a bunch of different things and it does okay, at these different tasks. Not amazing, it’s not great. Text generation is not great. Language, vision tasks. It’s pretty good at playing Atari games, but not as good as models that have been finetuned for specific games. But interestingly, training it in a bunch of different things with a bunch of different serialized data, doesn’t seem to provide a massive performance gain in the way that you might, especially if you’re a scaling Maximalist. You might expect this to happen at some points and maybe it’ll happen at a greater scale. So I don’t know how much generalization there is from a system like Gato honestly, by which, I mean, I don’t know how much out of domain generalization you observe. And you do get this phenomenon of Few-shot Learning with GPT-3.

Raphaël: Again, I think there is a valid question to be asked about how much of a generalization that really constitutes as opposed to just guiding through prompt engineering, guiding the model to sample the right vision of the latent space, right? So, the initial GPT-3 paper, the original paper talks about Few-shot Learning as a form of meta learning. And later paper have dropped that kind of terminology. I think indeed, this was quite confusing. First of all, because there are different definitions of meta learning, in Reinforcement Learning people talk about meta learning to mean something else. But, also because it’s not really learning how to learn. It’s just abstracting some patterns from the prompt that has a certain structure to essentially sample the right region of its ginormous latent space, but there is no real, certainly no changes in the way it’s happening when you do few-shot learning. Even the term learning there is perhaps a bit of a misnomer. I do worry about how much and, not worry. I don’t worry about it, but how much you can call it generalization, especially if you mean something like out of domain generalization.

Michaël: Right. So I think what you’re referring to is that you’re not changing something in the memory of a language model when you’re doing few-shot Learning, but you’re rather activating certain parts of the weights by starting by a prompt that says task number one is this, task number two is this, please do task number three. So you’re accessing something by prompt-engineering. So I agree with that’s claim. And I agree also that, from what I’ve seen, there were not a lot of Transfer Learning happening in Gato. I think what it showed was mostly that you could have a single architecture that did a bunch of different things, with, I reckon only something very close to transformer and with the same process of optimization would be for language, for vision, for robotic task. But yeah, maybe that’s not truly relevant for out of distribution generalization.

Three Levels of Generalization, the Wozniak Test

Raphaël: Maybe one distinction that’s helpful there, is again from François Chollet’s paper on the measure of intelligence, which I quite like, is this distinction between, I think it distinguishes between three levels of generalization. So you have local generalization, which is a narrow form of generalization that pretends to generalize to known unknowns. So within a specific task. So that can be just, for example, you have a classifier that classifies pictures of dogs and cats, and then you can generalize to unseen examples, at test time, that it hasn’t seen during training. So that’s local generalization is just within domain known unknowns in a specific task. Then there is what he calls broad generalization, that’s generalizing to unknown unknowns within a broad range of tasks. So the examples he gives there would be level five self-driving or there was the Wozniak test, which was proposed by Steve Wozniak, which is building a system that can walk into a room, find the coffee maker and brew a good cup of coffee.

Raphaël: So these are tasks or capacities that require adapting to novel situations, including scenarios that were not foreseen by the programmers where, because there are so many edge cases in driving, or indeed in walking into an apartment, finding a coffee maker of some kind and making a cup of coffee. There are so many potential edge cases. And, this very long tail of unlikely but possible situations where you can find yourself, you have to adapt more flexibly to this kind of thing. And so that requires this broader generalization. And then there is a value question about this level two from Chollet about where do current models fit? Can we say that current language models are capable of some kind of project generalization because of their few-shot learning capacities? I suspect Chollet would say no, because there is a difference between being able to perform tasks that you haven’t been explicitly trained to do, which is what’s happening with few-shot learning.

Raphaël: So you don’t explicitly train PaLM or GPT-3 on arithmetic problems, say, but then it so happens that it can perform some arithmetic tasks after training. So there’s difference between that and being able to generalize to truly out of domain tasks, right? And given the training set of GPT-3 and PaLM, that includes a bunch of text talking about arithmetic and involving math problems and so on, you might very reasonably say that arithmetic tasks are not really out of distribution, right? They’re within the training set. So I suspect Chollet I would say we’re not yet at broad generalization. And then you have what he calls level three, which is extreme generalization, which is the capacity to generalize to unknown unknowns across an unknown range of tasks. That is very, very broad.

Raphaël: So it’s not just about driving around or making a cup of coffee, but basically generalizing to perhaps not any kind of problem, but within some reasonable constraints, virtually any kind of problem in the way humans can do, right? And do this very efficiently with minimal trials. So doing things zero shot or one shot, doing things like putting a man on the moon. Zero shot or one shot, few shots, perhaps, there’s very complex problems that we’ve never encountered before in a wide range of possibilities. So, I think it’s helpful to have this distinction in mind. I think it would be worth asking everyone you have on your podcast maybe “do you think current models can achieve broad generalization in Chollet’s sense?”

Michaël: I guess it depends on the timelines you ask, right? Is this something possible for humans to do, to build or if it’s possible this century or this decade. But to come back to just the Steve Wozniak definition, I don’t fully understand what’s the setting. Can you train your model, waking up in different rooms and brewing coffee in different rooms or different apartments, and then it encounters a new apartment and has do this? Or it has to learn how to do it by never interacting with a flat.

Raphaël: So I think the coffee test, as formulated by Wozniak, is underspecified. So you can have various interpretations of the test and various difficulties. I think the general test in its very simple formulation would just be train your model however you want, or design your system however you want. It has to be able to just walk into a room, find a coffee maker, brew a good cup of coffee. So you’re free to train your system on a bunch of virtual apartments with virtual coffee makers, or indeed train it in the real world with robots, with real apartments and real coffee machines. I guess that does raise a question though, at which point you can just brute force the problem in an interpolative regime by just having encountered so many different situations, so many different apartments with every possible conceivable coffee machine ever made by humans, that you can just solve this as a interpolative problem.

Raphaël: But I don’t know whether that would even make sense, given the other aspect of this test, which is the complexity of having a dexterous robot that can manipulate objects seamlessly and the kind of thing that we’re still struggling with today in robotics, which is another interesting thing that, we’ve made so much progress with disembodied models and there are a lot of ideas flying around with robotics, but in some respect, the state of the art in robotics where the models from Boston Dynamics are not using deep learning, right? So there’s still this gap between what we can do with disembodied models and what we can do in the real world.

Michaël: Right. So I think there’s a question of dexterity then there’s the vision and be able to classify images and see coffee makers. I think that’s not the hardest part, but maybe in this setting, you wake up in a black room and need to find the lights first and open the doors. And I guess the true question is the same as for self driving cars is, would you be able to do it reliably? So in a thousand different apartments, what is the likelihood of it actually turning on the coffee maker and brewing coffee, right?

Raphaël: Yeah.

Michaël: I guess doing it for, I don’t know, 50% chance, or even 10% chance I think is not out of our range today, but doing it reliably, like a human would like every day, that’s way more difficult. So you mentioned a lot François Chollet and Yann LeCun, who are French and also a bit of contrarian to the AGI takes. I just want to go back to one of the things you said at the real beginning, where for Yann LeCun there is no Artificial General Intelligence, because it’s impossible to generalize completely to any input.

AI progress Spectrum, Scaling Maximalism

Michaël: Where do you fall in those Twitter persona, like Chollet, maybe Gary Marcus and LeCun… Even our previous guest, Ethan Caballero, was more bullish on AI progress. Where do you see yourself in this spectrum?

Raphaël: Right, so if you want to think of the spectrum of positions on the progress of AI as a one dimensional spectrum, so you would have Ethan on one extreme or perhaps Nando de Freitas or people like that who think that scaling current architectures is all you need. What I call scaling maximalism. Then on the other end of the spectrum, you would have people like Gary Marcus, who is a pretty hard line skeptic about the capacities of current deep learning models. Perhaps in the deep end there would be somewhere in that region of the space as well. And somewhere in the middle, you’d have people like Yann LeCun and François Chollet and, probably as you suggested early on, the silent majority of deep learning researchers who think that current approaches have enormous potential, but also some limitations and that the way to move forward to get to something like human level intelligence in artificial systems will require new concepts and not just scaling existing architectures.

Raphaël: And yeah, it so happens that, Yann LeCun and François Chollet are French, but it’s not because they’re French that I find their positions reasonable, but I do happen to fall within that region of the space. So I would say that I align pretty well with the kind of positions that they’ve been advocating. In fact, it’s interesting because I was recently rereading the stuff that Chollet wrote about the limitations of deep learning in his book, “Deep Learning with Python”, which really goes way beyond learning how to perform deep learning tasks with Python, that he crammed a lot of theoretical stuff in that book too.

Raphaël: But there is this section at the end about the limitations of deep learning and this I think was written in 2019. So to be fair to him, this was before some of the more recent developments, but if you read some of his stuff there, it sounds remarkably similar to some of the stuff that Gary Marcus has been saying about how he’s advocating for hybrid neuros-symbolic architectures, and saying that deep learning is intrinsically limited in the range of things it can do because it’s limited to interpolative problems.

Raphaël: And so I think on that spectrum, you probably would have Gary on one extreme and people like Nando de Freitas and Ethan and others on the other extreme. And then I would say probably that François actually would be closer to Gary than Yann LeCun. Yann LeCun Would be somewhere in further to the right of that spectrum, if you think of this.

Michaël: Right. Yeah. I brought up the French thing because we are two French people speaking in English about French researchers and it was funny and yeah, about the Chollet versus Gary Marcus distinction. I think they’re both maybe saying something similar that deep learning has some limitations, but they’re pointing at different solutions where Chollet has a focus on generalization. And you mentioned his paper on how to measure intelligence with his new data set. And I think Gary Marcus is more focused on symbolic approachs for AI.

Raphaël: Yeah. Although, I was surprised rereading that section from the functionalist book that he explicitly advocates for hybrid architectures just like Gary. So he says that we’ll need what he calls “Algorithmic intelligence” and “Geometric intelligence” to work together where geometric intelligence is what you can get from deep learning and arithmetic. Sorry, algorithmic intelligence is what you get from symbolic, good old fashioned style AI systems. And he explicitly advocates, he thinks the future is going to be integrating these two things.

Raphaël: Just like you can think of AlphaGo as a hybrid architecture in that sense, because it has this Monte Carlo research. And so he gives that example, but yeah, it’s interesting. There is less daylight between Gary’s view and François’s view as I felt there was. Probably the difference between them is in terms of their actual views, is perhaps partially a matter of style. Gary has a more adversary style perhaps than François, but yeah, it was interesting to me to see that. I think Yann LeCun, for example, is more all in on a kind of differentiable approach, gradient based approach to solving human level intelligence. So I think he’s less convinced that you will need to plug in some kind of symbolic module.

Michaël: So you mentioned multiple times scaling maximalism and we’ve never defined it precisely. So can you maybe define this position or give the best steelman of that position?

Raphaël: Yeah. So this is a little phrase I coined on Twitter to refer to a position that I’ve seen pop up recently from a few people and perhaps the most succinct expression of that position would be through the slogan, “Scaling is all you need.” So kind of paraphrasing the attention is all you need paper and the various X is all you need papers that have been published since.

Raphaël: And so the idea is… It’s precisely the point of this Twitter thread that I made at the time was that it’s kind of hard to pin down in specific detail what the claims supposed to be, but intuitively the area is that we have observed since the days of GPT-3 and the work of Jared Kaplan and others, that there are the scaling laws with respect to the progress of transformer models, such that if you scale the model size and the size of the training data, you observe proportionate improvements of the loss and you have these nice plots that you can make from there.

Raphaël: We still haven’t hit the limits of the scaling laws. So it seems that even with the gargantuan models that we have today, like PaLM with like 540 billion parameters, we still haven’t hit the limits of the scaling laws. So there is this idea that’s pretty reasonable, that scaling is very effective. And furthermore, that through scaling model, model size and data size, you get these discontinuous improvements that had been observed already at the time of GPT-3 and have been observed again with PaLM, that when you hit certain thresholds, in terms of model size, you suddenly have this kind of non-linear face transition where the model suddenly gets much better at specific tasks.

Raphaël: So for example, arithmetic could be, or math related tasks would be one example. So given these two observations, scaling those and discontinuous improvements, people have started speculating about how much you can get for free as it were through raw scaling of existing architectures. And there is this view that was once a fringe view is becoming more popular that we can get go all the way to human level intelligence just by scaling existing architecture and throwing more data and computes at it.

Raphaël: It’s interesting because that was a view that someone like Gary Marcus and… I disagree with Gary many things, probably almost everything in some way, but I also disagree with a lot of the people that have been attacked, or he’s been disagreeing with. So again, I’m somewhere in the middle, but one thing I did notice, and I think he’s right about this, is that people used to say he was attacking this kind of scale, this maximalist view, he’s been attacking that for a while.

Raphaël: And people used to say, “Well, that’s a complete strawman and no one is actually defending that view.” And recently he’s become very clear that some people do hold that view, right. I mean, how literally they take it is an open question. And that’s the thing about throwing out, like throwing around these sweeping slogans is that you can get away with saying, “Oh, I didn’t mean it quite literally it’s more of a general aspiration it’s not…” and there are a lot of memes going around as well. So you can always have this kind of ironic take on all of this and say, “Well it was more of a joke. It was more of a meme.”

Raphaël: So it’s hard to ascribe very specific theoretical commitments to scaling maximalist, but that’s the general idea. That’s you can scale existing architectures and that’s all you need to get to something like AGI, if you want to use that term. Or as I would prefer to put it like something like human level intelligence and beyond.

Michaël: And I think when we mentioned scaling, there are a few things to consider. There’s the scaling of compute. So you throw more flops, and/or more duration at your system, then there’s scaling the size of your models. And then there’s maybe like some architectural tricks to make it work for different data or to scale more effectively. And when we say just scaling, we sometimes dismiss those threads or things people need to consider to make it work, gathering more data, different type of data, maybe multi-model that would transfer. And I think you could say that scale is all you need is a meme that dismisses all those electrics, but points at, we will not need more innovation. We will not need completely different architecture. And maybe a transformer is possibly what we need. And maybe those people only give a 50% chance of this being true. But so far we haven’t seen anything, any evidence that would go to like the opposite way.

Raphaël: Right, yeah. I think that’s an important point. There are several things in what you just said. One of them is actually going back to the scaling plots that is scaling those from Jared Kaplan and other people from OpenAI, and not misinterpreting what they’re really saying, because sometimes people throw these around and they extrapolate from them ironically. But so what these plots are showing is that you get scaling laws. I think the original plot is three plots showing scaling law for three relationships. One is model size flooded against loss. The reduction of loss in the model. Another one was data dataset size flooded against the decrease in loss. And the third one was generally just compute versus decreasing loss. So it seems like you get scaling laws for… you get improvements measured in terms of decreased loss in the predictions of the model. When you increase the model size, when you increase the data or more generally when you throw more computational power at the model, which basically is a trade off of data size and model size to vastly simplify.

Raphaël: And more recently the Chinchilla Paper from DeepMind suggests that we’ve been underestimating how much the exact ratio that would be optimal for that. And maybe we need more data. But anyway, what’s interesting for me is the other axis, which is sometimes you hear people talking about scaling laws as if it’s plotting model size or dataset size against intelligence as if we had some kind of metric of intelligence, but it’s just measuring autoregressive, the loss of autoregressive models or the decreasing loss, right? So as they’re predicting the next token and that’s at best a proxy for some perhaps slightly narrow in shorter sense of intelligence. We can’t readily extrapolate from that to some kind of scaling law about something like human general intelligence. So that’s the first point I want to make. We have to be careful that these plots are specifically about improvements in the predictions of autoregressive transformers.

Raphaël: And then there are the other things that you mentioned that, the scaling maximalists tend to go quickly over things like changes to the architecture. And one point that I made in my thread on that was that if you take the scaling is all you need view literally, it’s literally false or absurd because even of the various recent models that have led people to lend more credence to that view, such as DALLE-2 Gato, PaLM, Imagen, and others, all of these required some at least minor architectural innovation or minor tweaks to existing architecture. So they did require some changes to the architecture. They’re not just scaling transformers and seeing what happens.

Raphaël: So that’s one point. And then the other point you made is about the kind of data you fit to the model and how perhaps how you format your data, what different modalities you include, how you serialize it, how you fit it to the model, all of this matters a lot. And the Gato paper, for example, shows some innovation in that respect as well. There’s some innovative ways to serialize, both discrete and continuous data. So button presses, joint torques, text, images, in a way that is suitable to be fed to a transformer.

Bitter Lesson

Raphaël: So all of this is also kind of orthogonal to scaling. And so if you take the scaling is all you need view literally it’s false. So now if you want to be more charitable and discuss perhaps a more plausible version of the view, that is something that we can get to human level intelligence with minor tweaks to existing architectures like transformer architectures, which is still a very strong view and is still a stronger view that something like Richard Sutton’s “Bitter Lesson”, that gets thrown around a lot as a justification of scaling maximalism.

Raphaël: And I think these two are different.

Michaël: So I think Richard Sutton said that most innovation would be in the scaling or how to scale models and the tricks from researchers trying to get a few percent increase in benchmarks from some mathematical tricks or small architectural changes will be dismissed after a few years. And what will count will be the meta models, the model that will be able to generalize from more data. And his argument was pretty close to saying scale is all researchers actually need and not innovation.

Raphaël: So I disagree with that. I think, I think what Richard Sutton was saying, and it’s a short post on his website. So there is room for interpretation, but the way in which I read him is more saying something like “We don’t need, as we once thought we did, to build in knowledge in our models, it will always be more effective in the long run to learn from raw data and through bigger models, the kind of knowledge we were tempted to build into them.” And I think what he specifically refers to there is that the kind of feature engineering that was characteristic of earlier machine learning models. So we don’t need to hand craft features for these models to learn from, we don’t need to distill human knowledge in a way that needs to be hard coded in these models as priors.

Raphaël: We can learn a lot from raw data. I think in that respect, perhaps he would be more radical than certainly than someone like Gary Marcus, but also like someone at François Chollet because even François Chollet is adamant that we will need something like what has been observed, hypothesized, from the core knowledge literature in psychology, which is this idea that humans have innate priors with respect to things like object persistence and spatial relations and things like that. This set of core knowledge as they call it is the work of Elizabeth Spelke, for example, from Harvard that non-human animals that display intelligent behaviors and humans have these innate priors. Minimal set of innate priors. And so if you’re someone like Gary Marucs for example, you might think we need to perhaps hardcode some of these priors into models for them to have the generalization capacities of humans.

Raphaël: And I think part of what Richard Sutton was suggesting is that perhaps we can have a more empiricist view, which I’m open to and rather sympathetic to, which is that you can leverage the power of computation to learn this kind of human knowledge from raw data. So that entails that all else being equal, we should favor the kind of architectural innovation that will scale more effectively to learn better from more data. It doesn’t entail that we don’t need architectural innovation at all, right? It also doesn’t entail that we need no inductive prior, no inductive bias of any kind. At the limit, that’s just absurd because of the no free lunch theorem and the induction problem. You can’t just learn anything if you don’t have some kind of inductive bias. The real question is how much inductive bias and how much prior knowledge he needs. That’s also the crux of the disagreement between Gary and Yann LeCun.

Michaël: To summarize your summary of Sutton’s view, you’re saying that you advocate for less feature engineering in how we try to train our models and give more data to our models and they build the feature themselves, and they’re able to meta learn tasks instead of having humans preparing the task over. Is that a correct summarization?

Raphaël: Yeah, basically.

Michaël: And so I think a specific claim about scale is what we need is, “scale is all you need but for feature engineering or RL”. And then there will be other… scale is all you need” is a more general claim and it applies to NLP and other subfields of AI.

Raphaël: I mean, the thing is, if you think of it that way, then it becomes almost trivial because I don’t think many people still think that you need hand crafted feature engineering in deep learning. In fact, one of the main benefits of the move to deep learning beyond the kind of shallow machine learning algorithms we had before was to learn features from data instead of having handcrafted features. Right. So at the limit, I feel like this kind of characterization of the view truly [inaudible] it.

Michaël: I would nuance it by saying, in reinforcement learning, you want to train a model to play specific games or a set of games or tasks. And the feature engineering I was referring to was mostly maybe like hyperparameter search and try to change your algorithm to fit a certain environment so that your model would learn how to play a certain game. And if you’re able to have something that plays like a bunch of different games and meta learns how to learn games, then you wouldn’t need to change your hyperparameters, or change your training algorithm. It’ll just be able to learn effectively by himself because you’ve seen a bunch of different things. I think that’s a better description, probably.

Raphaël: And that doesn’t entail that you can do this only through scaling, if you want to think of a broader range of tasks than just playing Atari games, right? So again, it’s agnostic. I feel like it’s related to the scaling issue because scaling is part of the equation. Definitely. I think few people would dispute that, but at most it entails that scaling is a necessary condition for more general forms of intelligence. Not that scaling is a sufficient condition.

Michaël: Right. So meta learning is in practice more useful than researcher tricks from the 2010’s or 2000’s. I think that’s what he meant.

Raphaël: Right.

What would Convince him that Scale Is All We Need

Michaël: Because we’ve talked about some thought experiments and you’re a philosopher, I wanted to ask you this question: imagine Raphaël Millière from 2025, is now convinced that scale is all you need, or maybe a weaker: version. And he comes right now and he says: “Hey, Raphaël, this is what happened. Now I’m convinced.” What would be: something likely that he would tell you?

Raphaël: So in other words, what could convince me that scale is all you need? Well, again, part of the issue here is that we need to have a precise definition of the claim in order for it to be falsifiable or corroborated by evidence. So really the thing that the obvious thing that could convince someone that scaling existing architectures is all you need is if we get to something that looks like general human level intelligence, just by scaling existing architecture so that trivially, if we get there, if we have a system that can do all of the things that humans can do and pass various tests, including the coffee test, full self-driving, all forms of harder, difficult tests, then I would have to agree that scaling was all we needed all along.

Raphaël: If the architecture is not substantively different from the architectures we have today. But I guess your question perhaps would be more like what might at least lead me to revise my credence in the claim that scaling is not all you need. And perhaps think that we might get there by scaling these architectures. I think if Raphaël from the future: showed me something that’s basically similar to current transformers can reach human level at some of the hardest benchmarks we have today, such as Chollet’s ARC challenge, all of the BIG bench tasks, things like the Winoground benchmark that came roughly about like compositional and vision language models.

Raphaël: If you can do all of this just by having massive models with minimal changes to the architecture that would give me pause, certainly. I think that would give me pause and perhaps lead me to have more faith in emergent features of transformer models at scale. But I do think that transformer architecture is an extraordinarily powerful architecture. And I think a lot of people are short selling what transformers can do.

Raphaël: But part of the limitation of the current approach for me is the way in which we’re training model completely passively by just having a training phase that consists in ingurgitating massive amounts of data with a very simple learning objective, just predicting the next token or the next time step or something like that. And then having a frozen model that just is tested and much of downstream tasks, this is so different from learning that we observe from biological agent, that it seems very unlikely to me that this kind of approach can lead to the same degree of generalization to what Chollet would go extreme generalization.

Unsupervised Learning, Lifelong Learning

Michaël: So I think there are different ways a model could learn what’s happening better and better in the early 2010s was supervised learning. We got from good to almost perfect classification of ImageNet by just having deeper models. And so, you label a bunch of data and then you fit it to your model and people were disagreeing like “Oh, a baby can just see a couple of examples and doesn’t need like a million labeled images to know what’s a cat or dog.”

Michaël: And then right now I think that’s something Ilia from OpenAI said recently is that it’s crazy that now unsupervised learning just works and wasn’t working for like decades and now it just works, for no reason at all. And people are unsurprised. They’re just like, “Oh yeah, it works.” And you’re just able to train your model on billions of tokens and that’s crazier than what was happening before. Now you’re not labeling everything. The fact that we have something that can just like predict the next token and have good performance and a bunch of other tasks, or the GPT-3 downstream tasks, arithmetic, and a bunch of different things. It’s crazy in itself, right?

Raphaël: Yes.

Michaël: What you’re saying is that’s not how humans learn, right? But this is still thinking about only unsupervised learning and/or supervised learning, but then there’s like reinforcement learning that is like closer to what humans do. And there’s a massive progress in RL. And people are showing that you can mix all those different learnings to get something even better. So I think when we were just saying that we give it a bunch of examples or just predicting the next token is not like what humans do. It’s kind of dismissing what we could do in the future. Or like what is actually happening in RL research.

Raphaël: Yeah. I mean, I agree with everything you said, basically. I think the recent success of… Nowadays it’s fashionable not to call it unsupervised, but self supervised instead.

Michaël: Oh, sorry.

Raphaël: Which is interesting. I think it’s a matter of semantics, but it’s still interesting that there is a sense in which calling the kind of learning that large foundation models or language models do unsupervised that is slightly misleading, Because at the end of the day, whether it’s a massive language model like BERT, or an autoregressive model like GPT-3 you’re really creating these masks, whether you’re masking the next word or a word in the sentence. So, artificially creating these learning samples, and then giving a signal in the form of the last [inaudible]. Which is, again, not quite how humans learn things like [inaudible].

Raphaël: So maybe in other words, we haven’t yet reached the most unsupervised forms of learning that humans are capable of. So perhaps the self-supervised vs. unsupervised distinction is helpful to keep in mind in our respect. That being said, I mean, I agree that it’s tremendous what these models can do. And I agree that it’s very exciting what we’ve been seeing from the RL side of things as well. I’m not convinced yet that we can bridge these two things seamlessly in ways that are, that we can just scale to human-like generalizations. So, there is a lot of work on decision transformers to do some kind of offline reinforcement learning.

Raphaël: What I’ve read so far suggest that these models actually struggle to generalize appropriately. In fact, there is a paper that just came out, I was reading this like yesterday I think, precisely about this, that offline reinforcement learning of the kind that we find in decision transformers is very limited by the fact that it’s learning completely passively from data, right?

Raphaël: So two things that humans and animals do that most machine learning models don’t do is, one, active learning. I mean, this is something that RL models can do. So sampling the world to, like generating your own learning samples by actively sampling the world. And secondly, lifelong learning, which is an active area of research in deep learning, but still today, like basically virtually every model is trained first and then deployed as a kind of frozen model.

Raphaël: So, I don’t think, maybe I’m wrong, but my intuition is that we are not going to… The innovation required to incorporate things like active, online learning and lifelong learning into current transformer style architectures is not going to be just a matter of like minor tweaks. It will be more substantive than that.

Raphaël: And that, it will require, probably it might, it will require moving beyond transformer architecture, having attention be just a more minor part of the whole architecture.

Michaël: What do you mean by continual learning?

Raphaël: Lifelong. Well, just the ability to update the weights of the model continuously as you deploy the model, right? So currently you train the model once and then you can fine tune it if you want to further change the weights, but you have the training events, or a number of continual, like sequential training events, if you want to think of it that way. And then at some point you stop training, you freeze the weights, and you just test the model on downstream tasks. And that’s basically the way all of deep learning works.

Raphaël: But that’s not how humans or animals work. We keep constantly the synaptic weights in the brain are being adjusted, and we constantly learn from new experiences.

Michaël: You would want something that is connected to a real world stream of data, and update its weights in real time after each example. And the online learning part is, instead of learning in an offline fashion, it’s trained by directly accessing a stream of realtime data.

Raphaël: Yeah, and not just accessing a stream of data, but sampling data from the world, right? Or sampling information from the world, which, becomes quite important when you start thinking about things like causal inference and this kind of thing, intervention in the world becomes quite significant to move beyond correlations in the way you learn from data.

Michaël: So you’ve talked a bunch about the different benchmarks that could convince you today that scale is all you need is a thing. And you started with, things are kind of too difficult, which is, of course, if you achieve AGI, then you would’ve been convinced after seeing the evidence. And when you say “full self-driving,” to me, full self-driving is in some sense AI complete or AGI complete. You would need to have the AI to achieve full self-driving, but okay, I get you.

Goalpost Moving

Michaël: I think people from the “Scale is all you need” camp, who were very bullish on the AI, would tell you is that, “The goal posts have moved”. And if I were to interview you maybe like five years ago, you might have, maybe not you, but like someone else might have been very impressed by where we are today.

Michaël: But right now we need like another level to be impressed. I think there was a meme, sorry, a tweet by Miles Brundage from OpenAI, who was saying, “AI researchers from 2025 be like, ‘Oh, this model can generate an entire movie, but the plot is pretty bad. And the actor is not really convincing.’”

Michaël: So, right now we have models that can create realistic images with DALL·E and Imagen. And people are criticizing a specific part about the compositionality or human faces, or a bunch of different things. But we’re never going to be fully impressed, because things are kind of moving in a continuum. So, yeah, how do you feel about this whole goal post moving in the field of AI?

Raphaël: I mean, I have a bunch of things to say about this. The first thing I would say is that it’s perfectly consistent to be impressed by what something is doing, and yet cogently discuss the remaining limitations of that thing. Right? Otherwise, we’d just be like, “Oh, okay, pack it up guys. We have DALL·E 2, that’s all we need. There is no further improvement we can obtain in AI research. This is the pinnacle of artificial intelligence.” No one is saying this. So, come on. If we want progress, the basic first step is to lucidly evaluates the limitations of current systems.

Raphaël: And I think people do, I get it, people do get sometimes annoyed with people like Gary Marcus on Twitter, because he’s a bit of the gadfly of deep learning, constantly pointing out the limitations, rather the successes. And I do partly agree with that, sometimes I think we have to be fair. And I’m personally very impressed by deep learning models. I’ve consistently been impressed ever since I got interested in deep learning, and I’m not one to minimize the successes of deep learning.

Raphaël: In fact, I routinely on Twitter, for example, when I engage, show some examples of successes, I’ve done this with compositionality, for example. So I think we ought to be impressed by what deep learning can do, and no one, I think, very few people at least, could have predicted how far we could go with current approaches. So I think that’s absolutely true.

Raphaël: That doesn’t mean we should just stop being, like turn off any critical thinking and think that massively anthropomorphize what current models are doing, and think that DALL·E 2 is as intelligent as a human being. So then we have to stop for a second and try to carefully, meticulously, and systematically assess current capacities of models and remaining limitations.

Raphaël: That being said, I do agree that goalpost moving is regrettable and unhelpful. I think AI’s skeptics have been doing it for a long time, it’s become a joke at this point. But early AI pioneers used to say, “When we beat chess is when we have artificial intelligence systems.” And then once chess was solved, people have been saying, “Well, chess is not really the measure of intelligence.” And this has been happening over, and over, and over again.

Raphaël: I would say though, my personal view on this is that we are making progress towards more general intelligence. And I like to think of this as in this more relativistic or relational terms, we are increasing the generality of the generalization capacities of models we’ve been talking about this in this very podcast awhile back. But we haven’t yet reached the kind of extreme generalization that humans are capable of. And these two things are very consistent with one another, right?

Raphaël: So, we are making models that by some metrics could be considered more intelligent, or by some metrics could be considered to have a more general intelligence. And yet there are still remaining hurdles to get to the kind of general intelligence that humans have. So that’s different from moving the goalposts. That is just saying the goalpost, the ultimate goalpost remains the same, something like an intelligence, an artificial intelligence that has the same extreme generalization capacities as humans.

Raphaël: And we keep getting close to that, but we are not there yet. And the… Well, a couple of other things I wanted to say about this is that something that sometimes people fail to observe is that there is some goal post moving in the other camp from the other extreme side, right? Which is, if you’re a scaling maximalist, I suspect we’re going to see more and more of that, where you say “Scaling is all you need,” and then there’s a new model that comes out that achieves some breakthrough through some modification of the architecture.

Raphaël: And the scaling maximalists are going to say, “Oh, but that’s evident that scaling is all you need.” And yes, there are some modification of the architecture, but that doesn’t really count, right? It’s basically the same architecture.

Raphaël: And so, then a question is, how much architectural innovation will scaling maximalist allow before they acknowledge that this, this is not just a matter of scaling, but also a matter of changing the architecture or the format of the data, or various other things aside from scaling, right? So, I suspect we’re going to see a lot of goal post moving from that side too. And I think we should be, we should avoid goal post moving in both directions.

Raphaël: And the final thing I wanted to say is that this whole discussion reminds me of, we touched on this earlier, but I do worry a little bit about polarization in the discussion of these issues, especially on social media. And I had this exchange about this recently with Chris Olah, who seems to worry about similar things.

Raphaël: Again, like the vocal minorities that are more liable to goal post moving, are of the ones that perhaps occupy the most space on social media. But I think there are tons of people who try to do careful valuations of current models, acknowledging strengths, but also trying to evaluate remaining limitations. And this is the kind of work we should be doing.

Raphaël: It’s not very helpful to either loudly say that models are not intelligent yet, or that models are basically like currently existing architecture will, is basically getting us in the near term to human intelligence. I think if we take a step back for a second and just carefully evaluate the strength and limitations of the models, we can make more specific falsifiable claims that are less susceptible to be affected by goal post moving, if that makes sense.

What Researchers Should be Doing, Nuclear Risk, Climate Change

Michaël: Hopefully we can avoid goal post moving in both camps. You started using the word “should,” which is a very specific word for a philosopher, where I think… That’s where we disagree because… And that reflects what we say “is” about the world. Because I think people concerned with existential risk from AI are looking at AI progress and thinking about what could happen if something is true or not. So if scale is indeed all you need, then maybe there’s like, there’s a 10% chance. There’s a 10% chance, that in 5 or 10 years, we would get something close to AGI.

Michaël: And maybe then we can think of, like, what would be the impact in the world, and if we come back to the ought question, should we prepare for those risks right now, considering there’s a 10% chance of something dramatic happening in 10 years. And maybe what the AI skeptics are saying is… They’re probably more in the camp of, like, AI is so difficult that we’re trying to make progress. And all those AI researchers are in the same camp of trying to make progress on this hard task.

Michaël: And when they produce a paper, they’re trying to make progress. And what Gary or other skeptics are doing is maybe bringing more nuance and saying, like, “Oh, cool, your results are pretty good, but you’re not really generalizing to this.” But the goal is always like, “Make progress in AI.” And when you see this, you’re like, “Okay, cool,” so you’re being useful because you’re pointing out limitations. And then we can eventually make more useful progress.

Michaël: The other, I guess, I think there’s like another camp, who are maybe like the singularists, or the AGI cult, I will say, who think that it’s going to be good, or there’s a high chance of it being very good for the world and lower risk, or maybe they dismiss the risk. And this camp is very bullish on AI, and we’ll see everything as like, “Oh yeah, we’re going to get closer to something very good for humanity.”

Michaël: And yeah, I think Gary is maybe skeptic of this view or, or think the thing is very far away. So I think there’s a bunch of different criteria and people talking about it. And at least on my part, when I think about current progress, I think about if this is true and scale is all you need, what would it mean for humanity? And should we prepare for it or not?

Michaël: And I’m not doing motivated reasoning of it. I don’t want the thing to happen. I don’t want to reach AGI too soon, for me. And I guess another thing you were saying is, about social media and like people talking past each other, and a silent majority of people not saying anything. And you were saying that your takes were not very controversial, and that there was a silent majority of people thinking the same things as you do or something close.

Michaël: And there’s also something close related to existential risk from AI, where if you’re like an existing AI researcher at some prestigious lab or university, you cannot actually talk about existential risk, because it’s somehow out of the Overton window for AI researchers. And to me, there’s also a silent part of the researchers that are concerned about AGI or concerned about existential risk, but cannot really talk about it out loud, because otherwise they would lose their job.

Michaël: And it’s starting to open more and more, and people are talking about it more and more because they’re more impressed. Right? But I think there’s a silent majority in both camps.

Raphaël: Right. I mean, I think if you’re really concerned about existential risk from the development of AI, and you believe in something like scaling maximalism, then you probably shouldn’t be working for one of the labs that’s trying to bring human level or superhuman level into the artificial intelligence, right?

Raphaël: So you have to be consistent and put your actions where your mouth is. So if really there is this silent majority of AI researchers in various labs around the world who are kept up at night by the existential risk, and also think that we can get to that level of risk by scaling existing architectures, and yet in their day jobs are working on scaling existing architectures more efficiently, that seems to be a bit of a disconnect. So-

Michaël: It’s not always that simple, right? So, we both agree that, I guess, that there is some risk of catastrophe from climate change in the decades that are coming, right? But we’re not actively working in those because we don’t have an omnipotence on the outcomes. And so, I think so some people see AGI as climate change probably as something that could impact their future in different degrees. And they might not have a huge impact on this.

Michaël: And they’re just like seeing this from outside. And as an egoistic point of view, they’re just like, “Oh, why am I going to risk my job and my reputation, it’s only going to change by,” I don’t know, “0.001% the outcome.” And also those teams, so they’re a bunch of different labs working on existential risk. So those people that are not expressing themselves on social media, they might also have jobs that are not at Baidu or Microsoft or those things, but at actual labs working on existential risk.

Michaël: And those labs working on AGI, so maybe like OpenAI or DeepMind, also have safety teams or alignment teams. So it’s not like a clear separation between capabilities and safety, it is much more nuanced than that.

Raphaël: That’s true, but I don’t think your analogy works quite in the way you want it to work, because it’s not like… There’s a difference between having a regular job that is not worrying about the environmental collapse, and having a regular job, and not quitting everything you’re doing to work on, for example, renewable energies.

Raphaël: And on the other hand, having these worries about impending climate collapse and working for Exxon Mobil. Now, if you do think that existential risks are real, and that we could get there by scaling existing architectures and you explicitly work in your day job on scaling existing architectures, that seems to me like even if you are trying to work on things like alignment, if you really do believe that…

Raphaël: And I’m not saying that I personally do, because again, I’m not a scaling maximalist and I also have some qualms about how the debate on existential risk is framed… But if you do believe these two things, then it seems like instead of working on alignment, a better thing to do would be to try to work to militate against scaling these models at all. Right? Instead of actively supporting scaling efforts, that’s just an observation.

Michaël: You’re pointing out activism. So they could be more effective ways or things that people are not doing, which is strong activism against scaling or AGI. And I think those things are being brought up more, but it’s not the best way to bring this issue to AI researchers. I think if you want to actually push the field of AI Alignment forward, you want to bring good research to the world and show that those people are not crackpots from the 2000s thinking about something very far fetched, but there are actually knowledgeable researchers trying to make progress in things.

Michaël: So activism is being brought up, but is not the main concern at the time. And so to just answer your concern about those people working on scaling, you know that there’s, again, a continuum between training big models, training large models, and being the one, being Jared Kaplan, publishing like the scaling loss paper at OpenAI.

Michaël: So there are various ways to work on this, and working on AI makes you a better researcher that could possibly align the things. So like instrumentally, even if you do some research that pushes the AI field forward for 10 years, and after 10 years, you do some, like, five papers on the alignment, like the fraction of the research you do in the first 10 years, it’s so minimal compared to the other millions of researchers doing things, that your impact is very minimal.

Michaël: But at the end, there are so few people doing alignment, that like being an expert in this, in AI, will make a significant impact in the overall research. Right? So if there are like 100 researchers doing alignment, and you’re one of them, you’re doing a huge impact and learning things early on requires you to interact with AI. And the way to go is, work at Google for, do a PhD at Stanford, or other things, because otherwise you will just be someone writing about far-fetched topics without any AI understanding.

Raphaël: Maybe. But again, it’s like, and I’m not, again, I’m not saying that I, that I personally believe these things, but if you do have a very real concern for existential risk and are a scaling maximalist, and you have that kind of outlook, it’s a little bit like if you were, like in the 1940s working on nuclear weapons and saying, “Well, it’s okay, because at the end of the road, once I’ve contributed to the effort to develop nuclear weapons, I will write a treaty on international nuclear policy and how to avoid nuclear war,” right?

Raphaël: So maybe you will make a tremendous contribution that way to avoid nuclear war. But there is still a valid question about why you’re working, to develop that very thing in the first place if you are really are worried about existential risk, base, it’s kind of a cost-benefit analysis. If you really think the outcome that’s even remotely plausible is that the whole of humanity gets wiped out, why would you actively contribute to scaling efforts if you think that’s one road to that kind of outcome?

Raphaël: But, I mean, I personally am not that worried about existential risk comparatively to some other issues. But I also think it’s not either/or, there’s plenty of space to worry about different things concurrently. So I’m not saying that I subscribe to that view, but I’m just trying to place myself in the shoes of someone who does.

Michaël: Right. And I think, for an example, on nuclear war, or nuclear risk, for the Manhattan projects, there were probably hundreds of researchers in the US working on this, maybe same amount in Germany. And they didn’t really know that the thing would succeed. They were just trying, and they were, and it was very unclear for them how long would it take?

Michaël: And for the people who actually did nuclear fission for the first time, I think they were quite surprised. And yeah, in for AI, I think that the scale is just so different that instead of being like one in a hundred, you might be like one in a million. So your impact is much less, right? As an individual.

Raphaël: Yeah. I mean, you can say something about recycling and things like that, of course, but your impact as an individual is very small on climate change. Taking planes, recycling, having children, whatever you want to talk about, your impact as an individual is a drop in the bucket. It doesn’t mean that it doesn’t have implications about what you ought to do if you do have certain beliefs, right?

Raphaël: So again, I’m just thinking it’s kind of a controversial argument, right? But it, or at least like making some assumptions about existential risk being a real concern, and it being precipitated by scaling alone. But if you talk about nuclear weapons, a lot of scientists also refused to contribute to that effort. And so if that’s a genuine concern, that’s something that’s worth bringing to the table.

Compositionality, Structured Representations

Michaël: Yeah. To just go back to something you said before about compositionality and the things you’ve been reacting to more recently on DALL·E and Imagen, could you maybe define what’s compositionality for the other firms or maybe like in the context of image generation?

Raphaël: Yeah. So that’s something I’m currently very interested in my own work. So compositionality was initially defined as a property of language. You can define it as a property of formal languages. And although it’s more controversial for natural language, but the idea is that in a compositional system, the meaning of a complex expression is fully determined by the meaning of its constituent parts, plus its syntactic structures, so the way in which these parts are composed with one another.

Raphaël: So, you can think of a complex expression in first-order logic, for example, or you can think indeed of an example with natural language sentence, like, “The cat is on the mat,” the meaning of that sentence is determined by the meaning of its constituent parts. So, the meaning of the cats, the meaning of mats, the meaning of the verb is, and how they’re combined, syntactically with an agent and a patient, or like at least like a subject, the verb and a complement, and the way in which this meaning, and the meaning of the whole sentence is determined through that syntactic structure and the meaning of the part.

Raphaël: So that’s compositionality in the linguistic domain. Now, there are reasons to think that natural language might not be fully of strictly compositional in the sense that the meaning of sentences in natural language can be determined by some factors other than the meaning of constituent parts and syntactic structures. So things like the context of occurrence, various pragmatic considerations and so on.

Raphaël: But, a weaker claim would be that the meaning of at least some sentences and the meaning of most sentences is at least partially determined by the meaning of constituent parks and syntactic structure. So if you want to understand language, if you want to understand the meaning of sentences properly, you need to have a grasp on the meaning of words, lexical semantics, and you need to have a grasp on how these meanings combined together could form complex expressions, which is what we call compositional semantics.

Raphaël: That’s for language. And then you can by analogy, talk about compositionality in non-linguistics representational systems, including in the visual domain. So something that would be rather close to the notion of compositionality that’s used for language would be the compositionality of symbols for things like road signs. So you can think of individual elements like arrows and various other individual elements of road signs that can be recombined with each other, such that the meaning of the combinations in road signs is determined by how these symbols are combined together and the meaning of individual symbols. Now, when it comes to natural images, like an image, a picture of a cat, things are a little bit more fuzzy and the way in which you can think of compositional semantics, there is a little bit more abstracted away from the linguistic context, because there the notion of structure in an image is different from syntactic structure in language.

Raphaël: And also the notion of meaning is harder to apply literally. So what’s the meaning of a pixel or what’s the meaning of an edge? That gets a little bit more complicated, but still there’s a broad notion of composition there. And you can think of understanding what an image means or what a video means in compositional terms by segmenting the image into constituent parts and understanding how these parts are composed and interacting with each other. So when people talk about compositionality when it comes to AI, I think what they’re mostly concerned with is a parsing problem. How can we build system that can parse language or images or videos or whatever other domain that might be compositional in a way that is suitably sensitive to the compositional structure of the input and produce outputs that are appropriately sensitive to the compositional structure of the input.

Raphaël: So if you think of this in a text to text domain like text generation for an auto-regressive model, you can think of an example where you have some sequence of texts like man bites dog and then a question, “Who needs urgent care?” And you prompt the model to give an answer. If the model just uses the statistics, the distributional statistics of words, it might be prone to responding “the man,” because usually what gets bitten in that context, when you have a man and dog involved is this is an old example from Steve Pinker it’s the man, right? However, if the model really grasps the compositional semantics of the input, then it will say the dog. If you think of this in the text to image generation, then you can think of all the examples that I’ve been discussing on Twitter and others have as well like Gary Marcus.

Raphaël: So a [inaudible 02:03:02] case would be generating an image for the caption, a horse riding on an astronaut. That was the example that Gary Marcus talked about, where a human would be able to draw that because a human understand the compositional semantics of that input and current models are struggling also because of distributional statistics and in the image to text example, that would be for example, stuff that we’ve been seeing with Flamingo from DeepMind, where you look at an image and that might represent something very unusual and you are unable to correctly describe the image in the way that’s aligned with the composition of the image. So that’s the parsing problem that I think people are mostly concerned with when it comes to compositionality and AI. I think there is a further question. That’s about the kind of representations that the systems themselves have and whether they have structured representations that are compositional such that the representation of the complex expression is itself made of the combination of the representations of its constituent parts.

Raphaël: So Jerry Fodor, who talked a lot about compositionality thought that in order for humans to parse compositional semantics in language, we need to have a language of thought. So we need mental representations that themselves have compositional structure, and you might make a similar argument for language models or AI models, generally. If you’re convinced by this kind of argument, saying something like if you want a model to solve the parsing problem, to really be able to parse compositionally structured inputs in a reasonable way, you want this model to have representations that can be recombinable where simple representations can enter as constituents into complex representations with some kind of some syntactic structure, essentially implementing something like a classical symbolic architecture. And so I think Gary has some sympathy for that kind of view, at least in the realm of hybrid models where you need some kind of symbolic component that has this kind of feature. And some people think that you might be able to get away with a fully differentiable architecture that doesn’t literally have this kind of property.

Michaël: Yeah. So just to be more practical, the examples you gave was of a horse riding an astronaut that’s something Dall-E or Imagen are not able of doing. What were other examples of the limitations in practice, what are the images that they were not able to represent?

Conceptual Blending, Complex Syntactic Structure, Variable Binding

Raphaël: Yeah. So the thing about compositionality that I think is often under appreciated these discussions is that a complex expression can have more or less compositional structure or syntactic structure, right? So you have some examples of combinations or compositional complex expressions where you have a minimum amount of syntactic structures. So for example, conceptual combination, which you just put two concepts together. If I talk to you about the concept of, to take an example that’s popular with English narration, an avocado chair, right, there is minimal syntactic structure there. You’re just literally putting together two nouns and combining them as a single concept. So that doesn’t require you to abstract a lot about the syntactic structure of the sentence. There is no verb. It is just some kind of conceptual blending and current text to image generation models are very good at that. The avocado chair for example rose to prominence with Dall-E (the first one).

Raphaël: And I just posted some examples on Twitter of hybrid animals testing Dall-E 2 on combining two different, very different animals together like a hippopotamus octopus or things like that. And it’s very good at combining these concepts together and doing so even in a way that demonstrates some minimal comprehension of a world knowledge in the sense that it combines the concept, not just by haphazardly throwing together features of a hippopotamus and features of an octopus or features of a chair if it is an avocado, but combining them in a plausible way that’s consistent with how chair looks and would behave in the real world or things like that. So those are examples that are mostly semantic composition because it’s mostly about the semantic content of each concept combined together with minimal syntactic structure. The realm in which current text to image generation models seem to struggle more right now is with respect to examples of compositionality that have a more sophisticated syntactic structure. So one good example from the Dall-E 2 paper is prompting a model to generate a red cube on top of a blue cube.

Raphaël: What that example introduces compared to the conceptual blending examples I’ve given is what people call in psychology, variable binding. You need to bind the property of being blue to a cube that’s on top and the property of being red to cube that’s…I think I got it the other way around. So red to the cube that’s on top and blue to the cube that’s at the bottom and a model like Dall-E 2 is not well suited for that kind of thing. And that’s, we could talk about this, but that’s also an artifact of its architecture because it leverages the text encodings of CLIP, which is trained by contrastive learning. And so when it’s training CLIPs it’s only trying to maximize the distance between text image pairs that are not matching and minimize the distance between text image pairs that are matching where the text is the right caption for the image.

Raphaël: And so through that constructive learning procedure, it’s only keeping information about the text that is useful for this kind of task. So it’s kind of, a lot of this can be done without modeling closely the syntactic structure of the prompts or the captions, because unless we adversely designed a new data set for CLIP that would include a lot of unusual compositional examples like a horse riding an astronaut and various examples of blue cubes and red cubes on top of one another. Given the kind of training data that CLIP has, especially for Dall-E 2 stuck [inaudible 02:10:20] and stuff like that, you don’t really need to represent rich compositional information like that to train CLIP, and hence the limitations of Dall-E 2. Imaginen does better at this, because it uses a frozen language model T5-xl, I think, which, we know that language models do capture rich compositional information.

Michaël: Do you know how it uses the language model?

Raphaël: I haven’t done a deep dive into the imaging paper yet, but it’s using a frozen T5 model to encode the prompts. And then it has some kind of other component that translates these prompts into imaging embeddings, and then does some gradient upscaling. So there is some kind of multimodal diffusion model that takes the T5 embedding and is trained to translate that into image embedding space. But I couldn’t, I don’t exactly remember how it does that, but I think the key part here is that the initial text embedding is not the result of constructive learning, unlike the CLIP model that’s used for. Dall-E.

Michaël: Gotcha. Yeah. I agree that…yeah. From the images you have online, you don’t have a bunch of a red cube on top of a blue on top of a green one, and it’s easy to find counter examples that are very far from the training distribution.

Raphaël’s Experience with DALLE-2

Michaël: I’m curious about your experience with Dall-E because you’ve been talking about Dall-E before you had access. And I think in the recent weeks you’ve gained the API access. So have you updated on how good it is or AI progress in general, just from playing with it and being able to see the results from octopus, I don’t know how you call it.

Raphaël: Yeah. I mean, to be honest, I think I had a fairly good idea of what Dall-E could and couldn’t do before I got access to it. And there’s nothing that I generated that kind of made me massively update that prior I had. So again, it’s very good at simple conceptual combination. It can also do fairly well some simple forms of more syntactically structured composition. So if you ask it for, I don’t know, one prompt that I tested it on, that was great. Quite funny is an angry Parisian holding a baguette. So an angry Parisian holding a baguette digital art. Basically every output is spot on. So it’s like a picture of an angry man with a beret holding a baguette, right? So this kind of simple compositional structure is doing really well at it. That’s already very impressive in my book.

Raphaël: So I was pushing back initially against some of the claims from Gary Marcus precisely on that. Around the time of the whole deep learning is hitting a wall stuff. He was emphasizing that deep learning as he would put it fails at compositionality. I think first of all, that’s a vague claim because there are various things that could mean depending on how you understand compositionality and what I spouted out in my reaction to that is that really the claim that is actually warranted by the evidence is that there are failure cases with current deep learning models, with all current deep learning models at parsing compositionally structured inputs. So there are cases in which they fail. That’s true, especially the very convoluted examples that Gary has been testing Dall-E 2 on, like a blue tube on top of a red pyramid next to a green triangle or whatever. When you get to a certain level of complexity, even humans struggle.

Raphaël: If I ask you to draw that, and I didn’t repeat the prompt. I just gave it to you once you probably would make mistakes. The difference is that we humans can go back and look at the prompt and break it down into sub components. And that’s actually something I’m very curious about. I think a low hanging fruit for research on these models would be to do something a little similar to chain of thought prompting, but with text to image models instead of just language models. So with text chain of thought prompting of the Scratchpads paper of language models, you see that you can get remarkable improvements in context learning when you in your few shot examples, you give examples of breaking down the problem into sub steps.

Michaël: Let’s think about this problem step by step.

Raphaël: Yeah, yeah, exactly. And so, well, actually the “Let’s think about this step by step” stuff was slightly debunked in my view, by a blog post that just came out. Who did that? I think someone from MIT. I could send you the link, but someone who tried a whole bunch of different tricks for prompt engineering and found that at least with arithmetic, the only really efficient one is to do careful chain of thoughts prompting, where you really break down each step of the problem. Whereas just appending, let’s think step by step wasn’t really improving the accuracy. So there are some, perhaps some replication concerns with “Let’s think step by step.”

Raphaël: But if you do spell out all the different steps in your examples of the solution, then the model will do better. And I do think that perhaps in the near future, someone might be able to do this with text to image generation where you break down the prompt into first let’s draw a blue triangle, then let’s add a green cube on top of the blue triangle and so on.

Raphaël: And maybe if you can do it this way, you can get around some of the limitations of current models.

Michaël: Isn’t that already something, a feature of the Dall-E API? At least on the blog post, they have something where they have a flamingo that you can add it to be, remove it or move it to the right or left.

Raphaël: Yeah. So you can do in painting and you can gradually iterate, but that’s not something that’s done automatically. What I’m thinking about would be a model that learns to do this similarly to channel thought prompting. So there is a model that just came out a few days ago that I tweeted about that does something a little bit different, but along the same broad lines. So it’s breaking down the prompts, the compositional prompts of the diffusion models into distinct prompts, and then has this compositional diffusion model that has compositional operators like “and” that can generate first embeddings for. For example, if you want a blue cube and a red cube it will generate first embedding for a blue cube and for a red cube. And then it will use a compositional operator to combine these two embeddings together.

Raphaël: So kind of like hard coding into the architecture, compositional operations. And I think my intuition is that this is not the right solution for the long term, because you don’t want, again, the bitter lesson, blah, blah, blah, you don’t want to hard code too much in the architecture of your model. And I think you can learn that stuff with the right architecture. And we see that in language models, for example, you need to hard code any syntactic structure, any knowledge of grammar in language models. So I think you don’t need to do it either for vision language models, but in the short term, it seems to be working better than Dall-E 2 for example, if you do it this way,

Michaël: Right, so you split your sentence with the “and” and then you com combine those embeddings to engineer the image. I think, yeah, as you said, it is probably the general solution is as difficult as solving the understanding of language, because you would need to see in general how in a sentence the different objects relate to each other. And so to split it effectively, it would require a different understanding.

The Future of Image Generation

Michaël: I’m curious, what do you think would be kind of the new innovation? So imagine when we’re in 2024 or even 2023 and Gary Marcus is complaining about something on Twitter. Because for me, Dall-E was not very high resolution, the first one, and then we got Dall-E 2 that couldn’t generate texts or yeah. Do you know faces or maybe that’s something from the API, not very an AI problem, and then Imagine came along and did something much more photorealistic that could generate text.

Michaël: And of course there’s some problems you mentioned, but, do you think in 2023, we would just work on those compositionality problems one by one, and we would get three objects blue on top of red and top of green, or would it be like something very different? Yeah, I guess there are some long tail problems in solving fully the problem of generating images, but I don’t see what it would look like. Would it be just imaging a little bit different or something completely different?

Raphaël: So I think my intuition is that yeah, these models will keep getting better and better at this kind of compositional task. And I think it’s going to happen probably gradually just like language models have been getting better and better at arithmetic first doing two digit operations and then three digit and with Palm, perhaps more than that, or with the right channel of thought prompting more than that, but it still hits a ceiling and you get diminishing returns and that will remain the case. As long as we can’t find a way to basically approximate some form of symbolic-like reasoning in these models with things like variable binding. So I’m very interested in current efforts to augment transformers with things like episodic memory, where you can store things that start looking like variables and do some operations.

Raphaël: And then have it read and write operations. To some extent the work that’s been done by the team at Anthropic led by Chris Olah and with people like [inaudible 02:21:37], which I think is really fantastic is already shedding light on how transformers, they’re just vanilla transformers. In fact, they’re using time models without MLP layers. So just attention-only transformers can have some kind of implicit memory where they can store and retrieve information and do read and write operations in sub spaces of the model. But I think to move beyond the gradual improvement that we’ve seen for tasks such as mathematical reasoning and so on from language models to something that can more reliably and in a way that can generalize better perform these operations for arbitrary digits, for example, we need something that’s probably some form of modification of the architecture that enables more robust forms of variable binding and manipulation in a fully differentiable architecture.

Raphaël: Now, if I knew exactly what form that would take, then I would be funding the next startup that gets $600 million in series B, or maybe I would just open source it. I don’t know, but in any case I would be famous. So I don’t know exactly what form that would take. I know there is a lot of exciting work on somehow augmenting transformers with memory. There’s some stuff from the Schmidt Huber lab recently on fast weight transformers. That looks exciting to me, but I haven’t done a deep dive yet. So I’m expecting a lot of research on that stuff in the coming year. And maybe then we’ll get a discontinuous improvement of text to image models too, where all of a sudden, instead of gradually being able to do three objects, a red cube on top of a blue cube and then four objects, and gradually like that, all of a sudden would get to arbitrary compositions. I’m not excluding.

Conclusion

Michaël: As you said, if you knew what the future would look like, you would be funding as a series B startup in the Silicon valley, not talking on a podcast. Yeah. I think this is an amazing conclusion because it opens a window for what is going to happen next. And, yeah. Thanks for being on the podcast. I hope people will read all your tweets, all the threads on compositionality, Dall-E, GPT-3 because I learned personally a lot from them. Do you want to give a quick shout out to your Twitter account or a website or something?

Raphaël: Sure. You can follow me at, @Raphaelmilliere on Twitter. That’s Raphael with PH the French way. And my last name Milliere, M, I, L, L, I, E, R, E. You can follow my publications on raphaelmilliere.com. And I just want to quickly mention this event that I’m organizing with Gary Marcus at the end of the month, because they might interest some people who enjoy the conversation of compositionality.

Raphaël: So basically I’ve been disagreeing with Gary on Twitter about how extensive the limitations of current models are with respect to compositionality. And there’s something that I really like, a model of collaboration that’s emerged initially from economics, but that’s been applied to other fields in science called adversarial collaboration, which involves collaborating with people you disagree with to try to have productive disagreements and settle things with falsifiable predictions and things like that. So in this spirit of adversarial collaboration, instead of…I think Twitter amplifies disagreements rather than allowing reasonable, productive discussions. I suggested to Gary that we organize together a workshop, inviting a bunch of experts in compositionality and AI to try to work these questions out together. So he was enthusiastic about this and we organized these events online at the end of the month that’s free to attend. You can register compositionalintelligence.github.io.

Raphaël: And yeah, if you’re interested in that stuff, please do join the workshop. It should be fun. And thanks for having me on the podcast. That was a blast.

Michaël: Yeah, sure. I will definitely join. I will add a link below. I can’t wait to see you and Gary disagree on things and make predictions and yeah. See you around.