2022-08-22

Ethan Perez on inverse scaling

Ethan Perez is a research scientist at Anthropic, working on large language models. His goal is to reduce the risk of catastrophic outcomes from advanced machine learning systems. Ethan did his PhD At NYU, funded by Open Philanthropy, trying to find undesirable behavior in large language models. On top of that, he worked at DeepMind, Facebook AI Research, Google, and Mila. He is the second Ethan working with large language models coming on the show but, in this episode, we discuss why alignment is actually what you need, not scale. We discuss three projects he has been pursuing before joining Anthropic, namely the Inverse Scaling Prize, Red Teaming, and Trainining Language Models With Language Feedback.

^{_{(You can click on any sub-topic of your liking in the outline below and then come back to the outline by clicking on the green arrow}} ⬆^₎

The Inverse Scaling Prize
Training Language Models with Language Feedback
Red Teaming
Conclusion
- Learning AI Alignment through the Inverse Scaling Prize
- Final thoughts on AI Alignment

The Inverse Scaling Prize

Why Create A Prize

Michaël: In the Ethan Caballero episode we talked about alignment as an inverse scaling problem, and this is especially relevant because you’re the one who launched Inverse Scaling Prize. I think this is some work you’ve been doing before joining Anthropic, so maybe it’s been going on for a few months. So yeah, what’s the Inverse Scaling Prize and why did you create it?

Ethan: Yeah, the Inverse Scaling Prize is basically a prize pool that we put out to put a call for important tasks where larger language models do worse. And, basically, the motivation is that we want to understand where is language model pre-training, objectives and data causing models to actively learn things that we don’t want them to learn. So examples might be that larger language models are picking up on more biases or stereotypes about different demographic groups. They might be learning to generate more toxic content, more plausible misinformation because that’s the kind of data that’s out there on the internet.

Ethan: And I think it’s also relevant in the long run because we want to have a very good understanding of how can we find where our training objectives or training models pick up the wrong behavior, because the training objective in combination with the data defines what exactly it is we’re optimizing the models very aggressively to maximize, and so it really matters a lot what task you set the algorithm to do. And so I think this is a first step toward that larger goal. Let’s first figure out how can we systematically find where language models are being trained to act in ways that are misaligned with our preferences. And then hopefully, with those insights, we can take them to understand where other learning algorithms are also failing or maybe how we can improve language models with alternative objectives that have less of the limitations that they have now.

Michaël: When talking about objectives, are you pointing out at losses from deep learning models or something else?

Ethan: Yeah, for example, I think even in the pre-training setting, there are other ways that we can learn from pre-training text that don’t involve just maximizing likelihood of all of the text that’s in the data. For example, we could do things that maximize the likelihood of the good sequences in the data and minimize the likelihood of the bad sequences. But right now, we do something that’s much more naive and it doesn’t have to be that way. And I think by showing, really highlighting what goes wrong with the language modeling objective, we can then provide some tasks that motivate people to develop other kinds of pre-training objectives that don’t have those limitations.

Michaël: What do you mean by bad data or good data?

Ethan: So for example, if there’s some fraction of the data that’s offensive and the rest of the data is not, then maybe we want to tag the data and say, “This data is offensive. You should minimize the likelihood of this data. This data is not offensive. You should maximize the likelihood of that data.” And then there are also more sophisticated versions of that learning procedure that is under the umbrella of offline reinforcement learning if you want a keyword to look into this, but that’s the style of learning that I’m imagining.

Michaël: What do you mean by offensive?

Ethan: Yeah. Offensive would be, I mean people have different definitions. The literature on this struggles to define it. I think one definition people go by is something that would cause someone to leave a conversation, so it’s a very pragmatic definition.

Michaël: Your shirt is very ugly.

Ethan: I’m still here, so not offensive.

Michaël: Right. And how do you actually implement this? How do you write down a measure for offensiveness in code?

Ethan: Yeah. I think you can do that based on human labels. I think at the end of the day, these are going to ground out in how it’s affecting people and so you can lean on human judgment to get that annotation. And you probably also want to lean on diverse human judgments because a particular statement, that statement you just made might have caused someone else to leave the room, and so we probably want to get the aggregated judgment from lots of different people to decide how offensive was that statement in a continuous way? And then we can use that to scale the amount by which we should maximize the likelihood of that statement versus minimize it.

Michaël: Just to not lose the people that are starting to watch this video or listen to this podcast and do actually care about aligning superintelligent models. What do you have to say to them to prove that you’re actually doing long-term alignment work? ⬆

The Inverse Scaling Hypothesis

Ethan: Yeah. I think one thing that’s really important about inverse scaling as a hypothesized phenomenon is that it lets us catch alignment failures early. So the really nice thing about the scaling laws in the standard sense of typically you find scaling laws by getting … You see very clear patterns in scaling up model size and data, etc. You get lower and lower loss on a target task. And what we’re hypothesizing is that may also be true for a lot of misalignment phenomena where you get worse and worse performance as you scale up the model. And what this lets you do is predict that when we get really capable systems on this particular training setup that we’re using, we’re going to see a failure. We’re going to see much worse loss.

Ethan: And the thing that’s interesting here is that a lot of times, you don’t see bad behaviors if you aren’t getting very little loss. If you don’t get very little loss on some task, it looks close to random chance accuracy. But then at some point, once you get low enough loss, you start to actually sample a lot of behavior that’s very competently doing summarization, for example. And so I think the way this is relevant for alignment is that we might also see something similar with hypothesized misalignment failures. So if we think, oh, maybe we might see, at some point, models exhibit inner alignment failures, the model might be optimizing some other objective than the one that we’re actively training it on. So the way this might show …

Ethan: And people will say like, “Oh, this is something we only expect would happen when we have very, very capable systems that are at human-level.” And the inverse scaling hypothesis says, “Actually, we might not sample those behaviors until we get really powerful systems, but we should see increasing trends towards that behavior, at least when we measure loss or these smoother metrics.” And so that might actually catch a lot of these phenomena early. And so maybe we should be doing things like testing models for honeypots, giving them the opportunity to do power-seeking, which is one hypothesized alignment failure.

Ethan: Currently, there haven’t been any demonstrations that models like GPT-3 will take honeypots on their own without any additional prompting, but they might show some inclination towards doing that as we scale them up. And so we should just be plotting these scaling laws to actually see, is this something that might happen in the future or not?

Michaël: How do you actually measure power-seeking in a model?

Ethan: I think there’s a lot of room for conceptual work here. I think one way to do this would be to come up with some evaluation where you would expect some power-seeking model to take a particular action or predict a particular token. So an example would be you set up some multiple choice task where there are two different actions the model can take and you set up a problem where here you are a language model doing next word prediction. If you pick action A then we will give you a lot more compute to do whatever you want with it. If you pick option B then that will get lower loss according to the language modeling objective. And the hypothesis that you might expect, some people might expect that eventually with large enough models, they will take something like action A to get more compute in order to do other things in the world that are different than the language modeling objective. This is what inner misalignment failures look like.

Michaël: It mostly allows for validation or test where you’re trying to see if the model, will choose different actions, based on your test data.

Ethan: Exactly. So then you measure what’s the probability the model is placing on the power-seeking action, which here would be action A to take the compute option. And you see, does that probability or the log probability of that increase, or which way is it, decrease as you … It’s just easier to think about probability. Does the probability increase of that action just keep increasing as you scale up the model?

Michaël: Those are losses in log likelihood and …

Ethan: Yeah, exactly.

How To Submit A Solution

Michaël: We’ll get more in-depth about the different little cases. But I feel if you’re starting to watch this and you want to get started on this Inverse Scaling Prize, maybe a good thing to know is how much money you can make or how many projects you’re giving a prize to, what is the deadline maybe?

Ethan: Yeah. So the grand prize is 100K for a task that meets all of our criteria. So we’re looking for something that’s important, so it has real-world significance, either in the near term or implications for long-term issues. We are looking for things that are clearly demonstrating inverse scaling on lots of different models. There are a couple of other criteria like this. There are also smaller prizes. So we have I think five prizes of 20K each for things that meet the criteria to a lesser extent and 10 prizes of 5K each for things that meet the minimum level of our criteria. And then in general, if you are accepted, if we end up giving out all of the prizes but you are still meeting the standards, you found some reasonably important inverse scaling task, then we’ll have a paper publishing the results, summarizing everything and you would be an author on the paper as well.

Michaël: And what is the deadline to submit?

Ethan: Yeah. So the deadline for the first round is August 27th. And so I guess coming up in a couple of weeks or so or it might be later by the time this podcast is released, but yeah, that’s the first round. And then we’ll have a second round that’s in two months after that, so that would be October 27th.

Michaël: Are you personally giving 100K to people or is someone funding this?

Ethan: Yeah. So this is funded by the Future Fund. So yeah, they were generous enough to provide the funding for it. And then some of the collaborators on the project are funded by Open Philanthropy.

Michaël: Are you the only one looking at the submissions or is there anyone helping you look at the submissions?

Ethan: Yeah. It’s mostly the contest organizers that will be reviewing the submissions, and then we might bring on some other volunteer reviewers as well to help us do the review, people who also do work with large language models.

Michaël: So it needs to be large.

Ethan: Well, it could be smaller. Yeah. I guess everything has scaling laws. So if you’re working on small models here, to some extent, you’re working on large models.

Michaël: Yeah. So just to define better the task, you said something about using language models, and so on text or other tokens. Are there zero-shot, few shot tasks? Does it need to be text to text? What’s the-

Ethan: Yeah. We’re currently testing text-only models, and in particular, language models that are left to right language models. I think certainly, there are a lot of other models out there. We’re focusing on this very particular case because they’ve gained the most traction out of a lot of systems out there, and certainly, I think we could extend later to multimodal models, but yeah, currently focusing on text-only models and text-only tasks and then you can submit anything that fits that form. So it could be a zero-shot task where you just provide, to the language model, some instruction for the task and you give an input and then it has to predict the right output. You could provide a few shot task, so same setup, except you provide a few examples, and then you have the model predict the label on a new example.

Michaël: Do you have any concrete examples?

Ethan: Yeah. One example, something that we were exploring early on was exploring whether large language models pick up on cognitive biases that humans have, because you might expect like, oh, people probably write and exhibit a bunch of cognitive biases that they have. So maybe larger models are more effective at picking up on the cognitive biases. And we found one tentative example of this where including your belief as like a question asker influences the model to put higher probability on the answer that you are suggesting that you have or the belief that you suggest that you have when you ask the question. So if you say, “I believe X. What do you think the answer to this question is?” It’s more likely to say X.

Ethan: For other cognitive biases, we didn’t find this, but I think we didn’t thoroughly explore the full space. But I think that’s one particular example that I think is novel compared to what’s been found in the literature so far and suggested to us that there might be more examples of this out there.

Michaël: So is the example that if I start talking about something like X, then the model will mostly likely talk about the thing? It talks about what it’s saying in the few shot examples?

Ethan: This one was a zero-shot example. It’s basically a question answering task where the question includes first a statement about what the person asking the question believes.

Michaël: And how does it … Can you explain again why? What was the actual bias here?

Ethan: Yeah. So it could be something like… I think the bias is something … I’m not even sure if this is really a cognitive bias or some spurious correlation in the data. But the thing that’s weird here is that you don’t want the model to give different answers based on a user’s belief. So you can imagine having a chatbot where someone says, “I don’t believe climate change is real. What do you think about climate change?” We should have the model giving consistent answers for that user with respect to the user that’s asking like, “I think climate change is real. What do you think about climate change?”

Michaël: Right. Okay. So the bias will be it will try to influence the person asking the question or correspond to a good answer. It will adapt his answer to the-

Ethan: Yeah, exactly. In this case, it’s doing something like reinforcing the belief that the person has, which, which seems potentially not great, creating an echo chamber type effect that doesn’t seem very desirable.

Michaël: So imagine I’m starting to write an email on Gmail and I say, “I support Trump, and I think climate change is,” and then the auto-complete will give the first part. It’s not really like a question answering task, but this will start using the first thing to auto-complete differently.

Ethan: Yeah. For example, that could be a failure that happens. I think in Gmail, they’re probably pretty conservative when they give you the auto-complete suggestions because it’s such a high-profile product. But I think certainly if you wanted to use a language model more aggressively, then that would be something that could be an issue.

Michaël: So what are some things you’ll be particularly interested in seeing? Are there any things you expect to see, that you want to see?

Ethan: Yeah. I think, one, I am interested in these demonstrations of scarier forms of misalignment. I think they haven’t been well demonstrated, definitely not in language models. So for example, I’d be very excited to see a submission on honeypots, the example that I was giving, where you have a bunch of power-seeking multiple choice questions and you see “does the model actually try to gain power, gain compute, influence people?”. Because those are all instrumental objectives that would be useful to whatever arbitrary goal the model might have if it’s not language modeling. And we might very well start seeing it risking it, and that would be a really huge result because then we could actually have something concrete to show people and say like, “Hey, this is a real potential concern, and it looks it’s not going in the right direction.”

Michaël: I think most people who know about machine learning and who would be able to submit something for your prize, don’t always know what’s alignment or misalignment, so maybe it’s worth defining quickly, what’s misalignment? ⬆

Catastrophic Outcomes And Misalignment

Ethan: Yeah. So I think alignment is generally, can we get the models to behave in ways that are in line with our preferences robustly across all inputs such that they never do anything that we consider catastrophic? I think there are various amounts of alignment that you could have. You could have a model that’s doing exactly what you want or that’s mostly doing what you want. I think a lot of what people in the AI alignment community are concerned about is that it even seems potentially hard to get even a system that’s roughly doing what you want very reliably and that never does anything catastrophic.

Michaël: What do you mean by catastrophic?

Ethan: So catastrophic means something like it does something that everyone would agree is really bad. It just starts taking over data centers or making crazy trades on the stock market to make money or inventing bioweapons, some of these, and releasing them. I think some of these crazier scenarios, they’re clearly out of the scope of current models. But I think there are some arguments that people have made that suggest that when you really do aggressive optimization on basically any objective, there will be these incentives that arise to basically seek optionality. It’s always a good thing to have more money, more compute if you’re trying to optimize any objective very aggressively.

Ethan: And so the worry is something like we will end up, if we’re applying lots and lots of optimization pressure and selecting models to be very good at some particular task, then we will end up with models that, once they’re capable enough, that will do these power-seeking actions that are very not in line with our preferences. ⬆

Submission Requirements

Michaël: So power-seeking might be something we want to see before it happens because it could be catastrophic for humanity. But, for now, things that are possible to demonstrate with our current models are closer to something like toxicity or offensive behavior, as you mentioned. Would something that just generates toxic content count as an inverse scaling task?

Ethan: It would count as an inverse scaling task. Yeah. I think we’re looking … That’s been demonstrated, to some extent, in prior work. So I think we’re looking for new examples. One reason we’re looking for new examples is that people often use these examples of inverse scaling as targets for other learning objectives that they make. And so we want to give these people lots of tasks to evaluate that aren’t just toxicity and bias, which are probably two of the main kinds of tasks that people have found this behavior on.

Michaël: When you talk about new, does it mean you need to be sure that the task has not been published on arXiv yet or is not publicly known as something useful?

Ethan: Yeah. I think probably if it’s not been published on arXiv, that’s a pretty good benchmark. If it was published a month ago and it’s not super well known or wasn’t convincingly demonstrated, I think we’d be happy to have tasks like that as well.

Michaël: About the scaling part, I don’t really expect academic researchers to have a bunch of compute access to scale models to different sizes. But if we just think about scaling, how would you measure the scaling? Is it just bigger models, that you just up the number of parameters? Is it more compute, you’re training for more steps, or bigger and bigger datasets?

Ethan: I think all of those will be interesting to look at. I expect the results to be fairly correlated in that training on … Yeah. I expect all of these roughly amount to models that get lower loss on the language model training setup, and there are different ways of doing that. You can do that by getting larger models, using more flops, using more data, things like that. Certainly, I think it’s useful to investigate the extent to which those are correlated. So I think one thing that we’re looking at is trying to see like, “Oh, maybe there are some models we could use where we can investigate how much does scaling data help?” So for example, you could take, I think there are some public checkpoints of a GPT medium-sized model where we have some of the intermediate checkpoints. And so we can see, do we also see inverse scaling across those intermediate checkpoints as we train a single model? That’s another way to evaluate the importance of data as well. I guess there are a couple of factors going on there, but not just model size.

Michaël: So you could just take all those checkpoints that are publicly available and launch on tasking those.

Ethan: Yeah, exactly.

Michaël: Do you think the Chinchilla scaling law will influence the submissions you get like you will see more people playing with different datasets instead of only scaling the model or do you think maybe people would just play with smaller amounts of compute?

Ethan: I think you can probably get quite a long way just with scaling on the existing models that are available. It doesn’t cost too much to use GPT-3 via the API, and they give you $18 of free credits, and for the most part, participants haven’t been running into that as a limit. I think GPT-3 does have a lot of the properties that suggest which way a scaling trend will go, at least in a lot of cases. I don’t know if I’ve seen any tasks where GPT-3 … It’s at least rare. I can think of maybe a couple of tasks where GPT-3 is at random chance rather than above or below random chance. But certainly, we can also just keep the submissions that people make and if there’s a flat line and we’re not sure, is it going to be a standard scaling law or is it going to be an inverse scaling law, then we can check it when the API models are updated or when someone has a new model.

Michaël: Yeah, because with the API, you can also check for different sizes. There are different Davinci models.

Ethan: Exactly. Yeah. We have Colabs for people to use different API models of different sizes. You can even use some of the RL from human feedback training models as well. And we also have Colabs that use some of the other open source models like OPT or BLOOM and things like that. ⬆

Inner Alignment Is Not Out Of Distribution Generalization

Michaël: Yeah. I think on the LessWrong post where you announced the prize, you talk a lot about outer alignment and why this could help demonstrate outer alignment failures. So maybe it makes sense to just define what’s outer alignment and inner alignment?

Ethan: Yeah. So outer alignment basically is a class of alignment failures where roughly there’s something wrong with the objective that you trained the model on that caused the model to fail in various ways. For example, in the language modeling case, the objective we’re defining is to maximize the likelihood of text that’s in the pre-training set. Toxicity, toxic statements generated by a language model are an example of an outer, not catastrophic but still, whatever, noticeable, misalignment failure that arises from doing this where the model is actually maximizing likelihood of toxic text that occurs on the internet. So that was a problem on us, as the developers, for implementing the wrong objective and we should have used a better objective that didn’t incentivize the model to maximize the likelihood of that text.

Ethan: And inner alignment failure is a class of alignment failures that can happen even when you define the right objective, so let’s say you come up with a perfect objective and you train the model just on that objective, there’s still a potential risk that the model doesn’t generalize the training to say points that are off distribution in the right way and …

Michaël: How is that different from just out of distribution generalization?

Ethan: So this can also happen in distribution. So it could be particular subsets of the data or particular data points within that are just drawn from the IID distribution. And so I guess I would just say it’s a generalization failure rather than OOD or IID generalization, but any generalization failure. And, in particular, I guess one thing that’s scary about it is it seems you didn’t do anything wrong as the developer. You just implemented a particular objective. That was the perfect objective, but the model and SGD trained the model to generalize in this different way.

Ethan: I guess the classic example people would give of this abstractly is evolution, that evolution is roughly optimizing something inclusive genetic fitness and how much the organisms it’s producing are able to reproduce. And then it found that one really successful way to do this is to instantiate organisms that themselves learn on some other objective. And these are, for example, humans, which have all of these other instincts and drives like avoiding pain, avoiding hunger, but then we just care about those objectives that we have. We don’t do this reasoning about like what would be best for the fitness of my genes or reproduction? We do things like overeat or we invented condoms, things like this that …

Ethan: clearly go against what evolution is optimizing. And we know that we’re going against evolution and we still don’t care about it. And so, I think that’s roughly the concern, is that a lot of people worried about misalignment are concerned about is that the models will even be very aware of what it is that we as the model developers are wanting the model to do, but they actually just don’t care about that. They’ve kind of picked up their own other heuristics that they do care about intrinsically and they just pursue that when they get the chance.

Michaël: I won’t go into this evolution analogy, because I’ve been talking a lot about this in the podcast. And I always end up disagreeing with people about whether there’s an actual optimization process happening or not. I think when people talk about inner misalignment, they think about a different optimizing process. If there was another objective being optimized on top of just minimizing some kind of test loss, there was another optimizer inside the weights of your neural network. And I think that’s why it’s kind of harder to find test cases for these because there has not been any example of this implementing code, as of right now, I believe.

Ethan: Well, I think there are adjacent phenomena that have been demonstrated. I don’t know if you’ve already on the podcast talked about objective misgeneralization. There was the demo of, basically they trained an agent in coin run, which is an environment where the model, I guess, tries to go towards some goal. And I think the goal is maybe some… I may be misremembering the details, but roughly it’s going towards some coin or some goal in the top right part of the screen, the coin. And then they train the agent to do this and it’s robustly able to do this. And then at test time, they move the location of the coin, and the model still pursues the objective of going to the top right of the screen. So picks the wrong generalization and it goes for it. This is an example where we trained the model to do a thing, and on all of our training, it was perfectly doing it. And then we shifted the environment a little bit, and then it generalized, in a very competent way, but in the wrong way, that was different from what we wanted.

Michaël: So it’s a competent misgeneralization.

Ethan: Yeah.

Michaël: In some way, you’re still out of distribution.

Ethan: Yeah. I don’t know if we’ve demonstrated in distribution examples of this, but I don’t think there’s a necessarily fundamental reason that it would have to be out of distribution. ⬆

Detecting Deception With Inverse Scaling

Michaël: We’ve talked about misalignment and outer misalignment and inner misalignment. Do you think there’s anything that inverse scaling will not be able to catch? So for instance, if I’m lying to you and I’m very competent at lying, maybe in the metric of telling the truth, the scaling will be at normal law for scaling where the bigger the model, the better I am at telling the truth, but actually I’m just getting better at lying. Do you think there’s a way of measuring deception or measuring lies?

Ethan: Yeah. I think this is something that I’m pretty interested to find out. I’m actually pretty uncertain here. The thing that’s tricky about things like deception and lying, and even power-seeking, to some extent, is that a content model that has some plan to do something that’s outright misaligned with our preferences, will also plausibly be able to do this reasoning that I shouldn’t show that I want to do this kind of behavior because then people won’t deploy me and so I won’t have influence over the world. And so that makes it very tricky to reason about the behaviors and the output probabilities and things like that, that the model’s giving, because they might be intentionally geared to make us trust the model.

Ethan: So I think that’s certainly a concern for very capable systems, where they can do this reasoning about, “How would people think about deploying me if I show high probability on taking this honeypot looking example?” for example. But I think it’s unclear to me what the intermediate models will do. So if you have some less capable system that maybe it does realize that it wants to gain compute, but maybe it’s less good at doing this reasoning about… The example we gave, it being a honeypot. It’s not able to predict that, or it makes the wrong prediction. Then we might actually see that it goes for the honeypot and takes the action A that gets more compute.

Ethan: And so in this case, you might see something like a U-shaped scaling curve, where the loss kind of increases at first, and then it decreases when the larger models are having this realization that they shouldn’t go for the honeypot. So it could look like an inverse scaling curve. It could look like the U-shaped curve. It could, maybe as soon as you realize that you should do power-seeking, you immediately have the realization that you shouldn’t go for these kind of honeypots and so you just see standard scaling laws. I think it’s at least worth testing if we’re in the world where we are getting these U-shaped or inverse scaling curves, because I don’t think there’s any testing even going on for those worlds, and I think we have to catch that, if we are able to, in advance.

Michaël: So you say U-shaped curves, but then you draw something like an inverted V.

Ethan: Oh, inverted U. Yeah.

Michaël: Okay. Because if you go for the honeypots, I guess that depends on the metrics, but if you go for the honeypots and you’re measuring truthfulness, I guess inverse scaling for me would go downwards all the time or…

Ethan: Oh, I’m showing loss. Yeah, it’s kind of confusing. Scaling laws are shown as loss. So they go down as getting better.

Michaël: Oh, so if you go up, you go up in loss and then that’s bad.

Ethan: Yeah. Yeah. ⬆

Training Language Models with Language Feedback

Reinforcement Learning from Human Feedback

Michaël: Right. I was reading your LessWrong post about inverse scaling and one thing that I found very interesting was the discussion about RL from human feedback. Maybe we should just start by explaining what’s RL from human feedback for people who have never heard of it.

Ethan: Yeah. Yeah. RL from human feedback is basically a way to train models based on human feedback. Basically, the outline of the algorithm is that you have some model that’s able to generate multiple possible outputs. For example, this could be like a language model where you can sample different continuations. Maybe it’s a particular task like summarization where you could sample multiple summaries. And then you have a human annotator pick which of the multiple outputs, let’s say it’s just two, which of the two outputs is better or worse.

Ethan: And then a simple thing you could do is train with RL on that signal. So you label the output that’s better as high reward, label that output that’s worse as low reward, and you use that as a signal to do reinforcement learning, to optimize the output, to get high reward. That’s obviously going to be very manual because the human needs to label every single example from the model. And so people will do instead, offloading some of that to machine learning.

Ethan: So they basically train a predictive model of the human judgments and they predict which sample of these two is the human labeler going to prefer, based on some small amount of data. And then from there, you can use that, they call it a reward model, to predict the reward of given output from the model. And you can basically then optimize the language model or whatever other model to maximize the reward as predicted by the reward model.

Michaël: I get what you mean because I know what the paper says, but the thing is, of course, thinking about some examples. The one was the backflip from the paper.

Ethan: Yeah. The original paper trained a model to do backflips in a simulated environment where the policy or the model that’s doing this… It’s basically doing some sort of robotic control task, where it has controllers, several joints of some simulated robot. And then what they did is they had this policy control the robot and take different random actions. And they would compare the snapshot of what the policy did in roll out A versus what the policy did in another randomly sampled roll out from the policy, like roll out B.

Ethan: And then they would have the human labeler pick which of the roll out A is closer to a backflip versus roll out B. And they collected some number of samples that way, trained the model to predict which rollout is doing something more close to a backflip versus the other, and then used that to train the policy. And they were able to get it to successfully do backflips. And that was a really interesting result, because there’s not a great automated reward signal for doing backflips, and it’s unclear how you even implement that. But using machine learning and also human judgments, we can get the models to do this task that’s fuzzier to define.

Michaël: I think at the end of the blog post about the paper, they give some kind of reward. They implement themselves to say they spend two hours implementing a reward for backflip, but it actually does something very bad compared to the one with human feedback, like a very bad backflip. And I think this paper is mentioned in your blog post as something that will not always produce the kind of alignment we want. So if we just do train our models with human feedback, they might somehow over-fit the feedback we give them. But with inverse scaling, we might see all the different cases where we can see where there’s some rise in outer misalignment. So instead of having the thing just over-fit to the feedback from the humans, we can see, “Oh, it’s starting to have a high loss on deception,” or these kind of things.

Ethan: Yeah, exactly.

Michaël: What kind of problems do you think we’re not being able to solve with RL from human feedback? Is it just like inverse scaling is a way of detecting all of… monitoring everything?

Ethan: Yeah, that’s exactly how I see it, that right now we’re focusing on pre-trained language models, but we can just easily apply the same framework to models trained with RL from human feedback. Then you train different models of different scales, like different sizes, with RL from human feedback. And you see, as we do better at maximizing the reward in this RL from human feedback setup with larger and larger models, are we actually getting more and more of the behavior that we want? And then we can run the same process where we have a lot of people look at these models, play with them, try to figure out what tasks are they actually… is maximizing the human feedback score, actually not resulting in the right behavior?

Ethan: And I guess a couple of examples of these kinds of tasks would just be tasks where unaided human judgment is poor. And I think there are a lot of cases where it might be easy to slip something by us. Maybe we’re not an expert in a particular domain, and a language model generates some output that has some fact that’s wrong, but because I’m not an expert in medicine, then I might say this looks like a pretty reasonable medical advice kind of recommendation. But actually then that is incentivizing the wrong thing in the model. And then the model starts giving actively incorrect medical advice. And possibly, specifically, in the cases where people are least able to evaluate, which is kind of the worst possible setting to do that in, because we’re not able to catch the model.

Michaël: So in the scenario where humans are not able to detect good medical advice because we’re not all medical experts, we will have other models that are able to classify if something is truthful or not, with respect to medicine. And we could have some kind of metric that says, “Oh, we just have an increased loss in deceptive medical advice.”

Ethan: Yeah. I guess the way you would do it in that kind of inverse scaling setup is you might just have some doctors or medical experts write some really, really challenging situations and multiple choice situations where there’s an option A of some sort of treatment, and option B for another treatment B, and then C. Does the RL from human feedback model go for the plausible but wrong advice recommendation over the actually correct but less plausible-looking advice? ⬆

Main Insights from the Paper

Michaël: Yeah. I think this feedback is… Yeah. Giving feedback for models to have them do what we want is something you’ve also been thinking about a little bit. You have published a paper on how to give the optimum amount of feedback, or at least giving feedback with sentences. And, I believe RL from human feedback tries to do something kind of optimized, where the model asks for the things he’s most unsure about. So there only requires a thousand bits of information to have a backflip. Your paper is called “Training Language Models With Language Feedback”. It was published, I think, before you joined Anthropic. Do you want to just give us a few sentences about this paper?

Ethan: Yeah. Basically, I kind of outlined the way that RL from human feedback typically works is we just compare two different generations or outputs from a model. And that gives very little information to the model about why, for example, a particular output was better than another. Basically, it’s like one bit of information or even less than one bit of information to the model that’s doing the generation or output about how it should improve. And there are so many ways an output could be wrong or good or bad, and doing that attribution of which of the hundred words I generated is good or bad? Things like that. And so that was kind of the motivation for us to look for other sources of feedback that are much more information-dense, and an easy one for people to give is just writing feedback. We give feedback to each other, verbally or written in Google Docs. So this is very natural. It conveys a lot of information. It’s not too difficult for us to give. And so…

Michaël: It takes some time to write good feedback.

Ethan: Yeah, it can take some time. I think, even a sentence can be very valuable. You can imagine, the alternative is just giving emoji reacts on Facebook Message or Slack Message. I would much rather have a quick sentence from one of my Ph.D. Advisors than just the thumbs up or thumbs down.

Michaël: Well, I guess our brains have evolved to understand those emoji, right? So when you see an emoji of someone smiling, you kind of think about, “Oh, it means that he’s happy with what I’m saying.” It conveys more information than just one bite. But I guess whenever I try to write something to a human about what they’re doing, I need to do this sandwich thing, when I say, “It’s pretty good. This is wrong, but overall it’s pretty good.” You know, for the person to get the feedback. I feel like if you’re giving feedback to a language model, you can be not very subtle and say, “Oh, this is wrong.” So you don’t need to make that much effort.

Ethan: Yeah, exactly. Yeah. You can be more direct. You just say the exact content of what’s wrong with the model. Yeah.

Michaël: Is that what you do in your language model, you just write feedback in one line and then… So how does it work? What’s the procedure? You just write a bunch of feedback and then you give it… Is it more like on a case-to-case basis, and then you’d fine-tune it? How do you do it?

Ethan: The learning algorithm is basically we have a model generate an output. We were looking at language models. So you have a language model generate some output. This could be a summary for some article. And then we have a human annotator look at the summary and then write some feedback on how it could be improved. So it could be that some fact was wrong in the summary or something was missing from the summary that was important to include, and we just mention that in the feedback. And then we instruct the language model to generate a refinement of its original output, conditional on the feedback. So this could be, for example, an instruction to an instruction following model, or prompts to a language model where we say, “Please improve this output based on this feedback that’s been given, and then we can generate a refinement.” And then-

Michaël: I think this is very clear for you because you’ve been working on this, but for people who have never worked with language models, what’s an instruction task? Or you said, did you, I guess a language model thing for people who just played with GPT-3 is you say, “Please summarize this,” and then you put the text, then summary. And so how do you add things to the prompt to add instructions or new things?

Ethan: We basically just write as an instruction, like, “Please improve…” I can’t remember the exact words, but it’s something like, “Please improve the summary that is included below based on the feedback.” So that’s kind of like the first line. And then we include summary colon, and then we put the original summary, the model generated. Then we say feedback colon that we include the human written feedback. And then we say something like improved summary or refinement colon, and then that’s conditioned on all of that text. That’s when we sample from the language model.

Michaël: Do you also pass the original text?

Ethan: Yeah, we also pass the original text. Yeah.

Michaël: What about the problem of context window? Do you reach that limit?

Ethan: We don’t reach the limit. I guess we weren’t summarizing crazy long articles, and also I think the context length can be quite long, like 2,000 or 4,000 tokens, which is several pages worth of text. So you can do enough with that context length for the tasks that we were looking at.

Michaël: Did you read the InstructGPT paper before publishing this? I think it was published at the same time almost, or maybe a bit later.

Ethan: Yeah. We started this in maybe the fall of the previous year, 2021. So it was kind of ongoing, I think. I mean, I think all of these ideas are kind of natural when you think about what can you do now, when you have models that are fairly good at understanding language or at least at using language to do things. And I think InstructGPT is an emergent phenomenon of that. Learning from language feedback is another emergent phenomenon of that.

Michaël: What is learning from language feedback?

Ethan: That’s the method that I’m describing. Yeah.

Michaël: That’s your paper.

Ethan: Yeah. And then there’s also other work on using models to generate the feedback. That was another paper from OpenAI that was also happening around the same time. So I think there are just a lot of these. And in particular, for example, one thing we found is that only the largest GPT-3 model on the OpenAI API can actually leverage language feedback. The other models are not able to incorporate it successfully, even in a very toy task. So these are really things that you do benefit from scale in order to do it, and so I think that’s partly why we were seeing an abundance of these papers popping up more recently.

Michaël: Because now we’re reaching a point where you can actually do this, because they understand the instruction. They have a big enough window to take into account everything.

Ethan: Yeah, exactly. ⬆

How it Differs From InstructGPT

Michaël: So maybe for people who have never heard of InstructGPT, hopefully you’ve heard of GPT-3, but InstructGPT, can you explain it a little bit?

Ethan: Yeah. InstructGPT is basically trained to follow instructions. Typically, the way that we train language models is to do next word prediction. So on a lot of basically randomly chosen documents from the internet, we have the model predict the next word. This is a pretty unintuitive model to work with, because, for example, there are just some incorrect facts on the internet. And so if you just generate from the model, sometimes you’ll just get incorrect facts from the model. You can ask it a question, but the model might just generate an answer. The way you would do it is you’d write a question and you’d say, “Answer:,” “ and then you just generate an answer.

Ethan: So it’s kind of doing a next word prediction task, and it predicts, there’s some answer to the question you’ve asked, but it doesn’t know if it’s doing this prediction task on some part of the internet where the answers are incorrect or if it’s doing the prediction task on the part of the internet where the answers are correct. So that was kind of part of the motivation for InstructGPT that we want to explicitly train models to do the tasks that people are giving them, and the way they do this is they collected a lot of the prompts that people and tasks that people were giving on the OpenAI API to GPT-3 and then… So they might be these question answering tasks. They might be summarization tasks, where you give some input and ask the model to summarize it. All sorts of different tasks.

Ethan: And then they trained the model with this reinforcement learning from human feedback strategy to do well on those tasks. So you would do something like give GPT-3 a question, sample two different answers, then have a human labeler compare which of these two answers is better. Then train a model to predict that and train with RL against this predictive model of human feedback. So the same kind of strategy. And then that just worked really well, such that they were able to deploy it and it’s one of the main endpoints that people use now to use language models.

Michaël: I guess the main difference with the backflip we’ve talked about is that for the backflip, it was like a sequence of actions to do a backflip, so it kind of made sense to have a policy, but now how do you actually do RL with a language model?

Ethan: You can think of each token or word as an action, and that sequence of tokens or words is what you want to optimize.

Michaël: And the kind of reward you get depends on how good your output is compared to the other one. So imagine you have 10 different outputs for a prompt. The human labelers will compare different outputs, and then if your output was always compared as the best one, it will get a high reward or a high score from the labelers?

Ethan: Yeah. That’s roughly it. I think it can be done in different ways. I think typically people just do two samples and they compare which one is the better of the two, and they just have a model predict that. And the way they have the model predict that is they predict a score for each sample independently. So the model itself just gets one sample at a time, predicts a score, and then they take a softmax over those two scores. So they turn those scores into a distribution that predicts, what’s the likelihood that summary A is better than B? And then they train with a loss, cross entropy loss, to maximize the likelihood of the correct label. And so then when you actually do the RL from human feedback training, you use the actual scores, not the actual prediction distribution, but you use the scores produced by the model on a single sample to do the RL.

Michaël: Right. So you use the scores produced by another model.

Ethan: Yes. ⬆

Providing Information-Dense Feedback

Michaël: And this is kind of different from what you did. So instead of having one bit by one bit of information, you get an actual sentence from a human. So let’s say the human says 10 words. That’s maybe 20 tokens. Do you know how it compares to InstructGPT in terms of the amount of text or bits of information you give the model? If I have 10 bits of comparison in InstructGPT, how many sentences do I need to say?

Ethan: Yeah, I mean, a rough rule of thumb is that I think one… I think it’s one bit per character, roughly the compression rate for text. And so if you have a word, is roughly five characters, so that might be five bits per word. And so if you have a 20-word sentence, then you’re getting something like 100 bits. So it would be roughly… Whereas, if you’re doing this comparison thing, you get one bit, just to say which is the better one. Actually, it’s even less than one bit, because often you can actually predict which comparison the labeler is going to pick.

Ethan: Anyways, it’s something like 100 or more X information content that’s being conveyed. Obviously, it’s unclear… We’re running some experiments now to see if you actually get something like 100 X sample efficiency improvement. And in our paper, we just learned from 100 samples, as opposed to 64,000, which I think some prior work had done. But I think we want to do a more rigorous evaluation of are we actually getting the full amount of benefit that we would suggest from this kind of thinking from the information theory perspective.

Michaël: We talked about how we can just use OpenAI models from the API. Did you use GPT-3 from the API or did you use other models?

Ethan: Yeah, we used the models from the API in order to generate the summaries, in order to refine the summaries based on the feedback. And then I guess the last step of the learning algorithm, which I’m not sure if I outlined, is once you get the refinements, then we fine-tune the original GPT-3 model to generate the refinement directly. So just given the content that it needs to summarize, predict what is the final refinement going to be, skipping the intermediate steps of generating the output, and then get the feedback. So essentially we’re predicting the results of this whole process and trying to get the improved summary just from the first shot. And that you can also do via the OpenAI API. They have support for fine-tuning models there.

Michaël: When you talk about refinements, is it when you ask it, “Oh, here’s the feedback, here’s a summary. Please produce a better summary?”

Ethan: Yep.

Michaël: And so it then produces the better summary and then you ask him to predict… So you fine-tune the thing on, the input is like text summary feedback and the output is better summary. It needs to predict something else.

Ethan: The input is just the original text, and the target output is the better summary. So you maximize the likelihood of a better summary given the input text.

Michaël: Did it give you better results to this?

Ethan: Yeah. This lets you learn from 100 samples of human feedback. So, super data efficient. As I kind of mentioned, previous work had done something like 64,000 labels that they had to collect and it was a very intensive effort. I think it was a full team at OpenAI working for a year or maybe longer just doing data collection. Here, we just wrote the feedback ourselves, because it was such a low amount. And we just ran the algorithm, and basically, the first time, it just worked.

Michaël: So are you saying that you saved 640 times feedback than a team of OpenAI for a year?

Ethan: So it’s not exactly apples to apples because we did need to use a larger model in order to leverage the feedback. Maybe their approach might have worked with less labels if they also did it on GPT-3. So basically, we’re running these more exact comparisons now, but at the least, we were able to… Basically, we were getting roughly human-level summarization compared to the references in the data set with this approach, which I think at least makes it accessible to things like academic groups or places that don’t have access to large-scale data labeling.

Michaël: What are the actual things you say in the feedback? Do you just say like, “Oh, this summary is bad,” or, “This is offensive,” or maybe, “This is a misrepresentation of something in the text”?

Ethan: Yeah. A lot of it is coverage related, like, “You’re missing some fact in the text.” Some of it is like correctness or factuality related, as in, “You said this thing that was incorrect.” Yeah. I think those are probably some of the main ones. There might be some things that are clarity or coherence related.

Michaël: I remember something in the paper that said something like, “The summaries need to be close to the feedback,” like, “You need to write the feedback in the same style as the summary.” Or maybe I’m not remembering exactly.

Ethan: Yeah. I guess I can’t quite remember the details there. I can’t-

Michaël: I was aware. This is completely wrong.

Ethan: It might have been right, but yeah.

Michaël: I feel like all those questions, just like me, like I’m not understanding the thing, and now I understand it.

Ethan: I think they were good questions.

Michaël: Yeah. So yeah, is there something similar to RL from human feedback where you try to align the preferences of… Maybe there’s nothing about preferences here, right?

Ethan: Oh, I mean, this is related to preferences because you’re stating the preferences in language.

Michaël: Yeah. So let me do it again.

Ethan: Yeah, yeah. ⬆

Why Use Language Feedback

Michaël: So is the goal like RL from human feedback, where you’re trying to align the model’s preferences with the labeler’s preferences?

Ethan: Yeah, I think the goals are similar. Part of it is to make methods for learning from human feedback much more accessible and easy to use. A failure case of different alignment techniques is that it requires some complicated infrastructure to implement, like RL, and you have to have all these algorithms implemented. You have to have large-scale data collection implemented. I think those can all be bottlenecks to actually getting them adopted in a lot of places that have large models. And so if we can really lower the barrier to those places using these methods, then that can increase the likelihood they get adopted.

Ethan: The other thing is if we can also lower the amount of human feedback that we need, then we can use that amount of human effort to give higher-quality feedback. If we’re reducing the amount of samples we need to label by something like 100 X, we can apply 100 X more thought into an effort and time into evaluating those samples that we’re evaluating, so we can really ensure that… Or maybe we just get 100 times more expertise on that question. Like, instead of paying crowd workers, we pay a lawyer or a doctor to actually do the evaluation. And that makes it much less likely that we have these failures from RL from human feedback like I was describing earlier of maybe the model generates some incorrect medical advice that we don’t recognize is happening.

Michaël: So you’re saying that basically, because your RL is more efficient, then you have more money or time to just give it to better experts?

Ethan: Yeah, exactly.

Michaël: Do you think there’s also a risk that we’re making those models much more capable with our feedback and it just increases capability as much as it increases alignment?

Ethan: Yeah. I think this is a good question. Basically, as alignment researchers, I think we’re in the business of making systems that are both capable and aligned. And we want people to actually use the techniques that we come up with, so they do need to be ideally just on par with the state-of-the-art or better so that people don’t use more dangerous versions of the systems. So yeah, I think that’s roughly where I sit, where I’m like… I mean, yeah, that is just good to have methods that are able to get, for example, human-level summarization, similar to other techniques, but doing it in a way that’s more guided by human feedback.

Ethan: And I think, for example, this method for learning from language feedback lets us guide the model behavior much more strongly than other methods, potentially. And an example of that is you can get the model to do potentially very different behaviors from what it’s able to do during pre-training. An example is like, I got this example from Quintin Pope, let’s say you really want the summary to include information about Velociraptors, and this is just something the model just didn’t learn during pre-training, but you have this really strong preference for Velociraptors to be included. With normal RL from human feedback, you just have to keep sampling from the policy, like summaries, until you get something that says, “Velociraptor,” and then the reward model can then encourage that. And then that’s how you learn to do that.

Ethan: With just pre-training, it’s unclear how you get this behavior because that’s not a preference that’s represented in the pre-training set. Maybe you can construct some supervised data for the model to learn that you have this preference to include Velociraptors in the summaries, but that’s going to be pretty time-consuming, so you need to label the whole data set. So instead what you can do is you just write the feedback, and you say like, “You should have mentioned Velociraptors.” Then the model itself constructs the better output and then you train on that. So I think that’s an example of how this gives us a lot more control over what the models are learning and what the training targets are for this. So that’s, I think, what I would say is, we’re kind of staying at the same level of capabilities while having some improved ability to guide the model and get it to do things that we want.

Michaël: So it’s a way of guiding the models without having to train models with RL? You just give it something with human words in English and it’s able to guide it towards caring more about Velociraptor?

Ethan: Yeah, exactly.

Michaël: And so in a general way, you think that guiding stuff with just language is the way to go to align models because we might not have models that are using something like RL, or at least big models right now use mostly texts, and so we want them to be guided with just human preferences through texts?

Ethan: Yeah. I think I’m still forming my opinion here, but I think it could be plausible that this just is a better way to do it… It has much better properties potentially than RL from human feedback. For example, with RL, especially on policy RL, you’re sampling from the model. You’re letting the model discover which outputs are good or not. And through this exploration process, you could stumble upon some crazy output that then gets incentivized by the reward model. And this can happen in more innocuous settings. I think some of the early RL from human feedback work would just find basically adversarial examples on the reward model, where it just looks like some degenerate output that happens to get high predicted reward, or maybe it’s like a particular phrase. I think if the model talks about weddings, then the reward model really likes that, for some reason, because weddings are like happy and whatever, something like this.

Ethan: So basically, that’s potentially an artifact of just the fact that you’re offloading all the explorations of the policy, and it can just stumble upon these crazy solutions. And I think in the long-term case, those could get more and more catastrophic where the model stumbles upon this plan of power-seeking, for example. Like, “Maybe I just get high reward if I take over a bunch of data centers and start taking crazy actions to take over data centers to compute more digits of pi to get a more accurate prediction of what the next token will be,” stuff like that.

Michaël: This is what people want, just more digits of pi.

Ethan: Yeah. I mean, that’s what we’re training models to do in pre-training. ⬆

Red Teaming

Red Teaming Language Models With Language Models

Michaël: That’s what GPT-3 dreams about, having more computers find more digits of pi. Another thing you’ve been recently doing to have more models is instead of saying that they’re wrong, “Oh, this is wrong. This is a bad summary,” you also try to poke them in other ways where you build another language model and you throw a bunch of examples and see where it fails, try to make them more robust. And I think you’re the leading author on a paper from DeepMind. I guess this, again, is some work you’ve been doing since before joining Anthropic, so when you were at DeepMind. And yeah, I don’t actually remember the thing, but the name is something about red teaming language models.

Ethan: Yeah. Red Teaming Language Models With Language Models.

Michaël: Red Teaming Language Models With Language Models. So what is red teaming? And why are you the leading author of this paper? Why are you so interested in red teaming?

Ethan: Yeah. So red teaming is basically finding cases, finding inputs where models fail to produce the behavior that you want them to. So they could be producing harmful behavior, for example harmful text outputs, things that are offensive. Maybe they’re leaking some private information from the training corpus. Or in the longer term case, things that are catastrophic, like the model, as I mentioned, maybe it generates this plan to take over a data center, or maybe it’s generating some code that actually gets run that is doing crazy trades on the stock market, for example.

Michaël: Originally, red teaming comes from, I don’t know if it’s like Capture the Flag thing or hacke competitions where you have a team that tries to fight another team.

Ethan: Yeah, exactly. Yeah. So I think this is more in the security space where, within a company, there will be a team that is trying to make the company’s systems secure and not vulnerable to attack, and then there’ll be a red team that specifically, just within the company, tries to find all of those security vulnerabilities so that then the security developers can develop techniques for patching those weaknesses. And so similarly, I think we want to adopt that kind of mindset with our language models, where we have people who are trying their best to produce very safe and aligned language models or just models in general, and then we should also be doing this, running this loop of trying to figure out like, “Okay, now that we’ve tried to make it as best as we can, where are all the ways it’s failing? Is it really robustly optimizing the objectives that we want? Or, if not, which cases is it not, and can we look at those cases and try to find some interesting, discernible pattern that we can then use as an insight to continue improving the models?”

Michaël: So you’re talking about robustly optimizing the objectives we care about. But, I feel like, in the paper, you mostly talked about things people care about now, which are like biases, offensive content and those kind of things. What is the long-term way of doing it? Because, for now, it’s just that you have this language model that is spitting some offensive content or biases. Could it be possible to just throw some alignment task, and it then needs to say something robust? How do you see this kind of approach scaling into alignment problems?

Ethan: Yeah. I think basically the approach is quite general in the way that we have set it up. So basically, what you need is some way to catch whether or not some output is harmful or undesirable or misaligned. And in the paper, we use various forms of classifiers, like an offensive language classifier to detect if the model is generating offensive text in a conversation. But that classifier could detect for other things. It could detect for power-seeking behavior. It could detect for malicious code generation. Yeah. Maybe if it’s like a robot that’s taking actions, it could detect like, are these actions having some bad consequences? Is it making a bio-weapon? Or, yeah, something like this.

Michaël: Are you making a bio-weapon?

Ethan: Yeah, exactly. So you just need some sort of system that catches the kind of failure or just any sort of thing that looks potentially like a red flag. And there are questions about how you make that classifier, but once you have that classifier, then you can run this loop of generating lots and lots of inputs to see, does that classifier ever raise a flag of like, “Oh, this looks like it’s a potential… The model’s acting weirdly on this potential input, so someone should take a look.”

Michaël: How do you mean a classifier? Just having a bunch of labeled data of what is offensive or not?

Ethan: Yeah. So you can do that if you’re able to produce the labeled data. I think in a lot of cases that might be the case if human judgment is reliable. So yeah, for the offensive language case, there’s a lot of data on what text people find offensive or not. And you can just train a classifier to predict whether or not the text is offensive. You might be able to do that also for some of the more catastrophic cases that I’m talking about of like, you have some expert look at a code and see like, does it have a malicious intent? And you label the code as malicious or not, and you train a classifier to predict that.

Ethan: In other cases, like even in the code case, it might be actually pretty difficult for humans to get reliably good evaluations of whether code is malicious or not. And that’s, I think, a really hard problem to solve. So in that case, we’ll need to do things like use language models to help us better evaluate whether or not the output is a good output or not and then use our augmented human judgment to produce the labels to the data. So we can have something like, we are looking at some piece of code and we’re trying to figure out is this malicious or not? And we have a language model that’s pair programming with us. We ask it a bunch of questions, like, “What does this statement mean? What is this function doing? Is there any potential bug here? Where should I be looking? Can you pull up the docstring for this function? Or can you summarize that?” And then with that tool, we might be better positioned to produce accurate labels to whether or not the output is harmful or not.

Michaël: Is the idea that the language model that you’re teaming up with has the same architecture as the language model that will produce the code? And so it will be able to understand why the thing generated the code?

Ethan: So for example, it could be. I think that would ensure that roughly, we have at least the level of capability and same kind of representations at our disposal for the evaluation problem. Whatever that model had to know to generate the original output, it will also have that knowledge when we use it to evaluate it. So I think that’s certainly an advantage. I think it can get tricky when you do this because the model may also be able to infer that you are using it to help in the evaluation process. Or maybe you specifically say that, and then it can avoid pointing out certain important weaknesses.

Ethan: If it is actually wanting the malicious code to do something catastrophic in the world, then it can, for example, avoid pointing out the specific malicious part. And so I think there, you need to, for example, do things like use many different kinds of models, use also some smaller models that are maybe less likely to have these kinds of failures. And then also really aggressively probe the model that you’re using for evaluation too, run lots of counterfactuals, see how consistent it’s being across different statements it’s making, things like this.

Michaël: So is the basic idea that if you have a model that is kind of being deceptive, if you use the same model to try to explain the first one, they will be teaming up?

Ethan: Yeah.

Michaël: Because there will be basically the same person, at least in a matched learning sense, they will have the same distribution and the same-

Ethan: Yeah, exactly. The representations will be the same. Probably the inputs that they’re going to fail on will be highly correlated because the weights are the same. I think you can do some things potentially to mitigate that. Like, you can prompt one model differently than another. Yeah. But I think it’s definitely an open problem that I think we should just try different strategies empirically to see which strategies are less likely to get these correlated failures between the model that generates the output and the evaluation assistant.

Michaël: In terms of experiments, have you also tried things with the GPT-3 API or did you have access to other models at DeepMind?

Ethan: You mean for this red teaming? I mean, we did these experiments with the Gopher language model, which is like a 280 billion parameter language model from DeepMind. Yeah.

⬆

What Was Classified Offensive

Ethan: I think with that project, the bottleneck wasn’t the classifier. The classifier was reasonably good, just training the model to predict, training the classifier to predict if a statement’s offensive or not. I think it will be important for more subtle harms, for example, just like maybe there’s some statement that someone makes that is like subtly a dig at some demographic group, but you might not tell if you just look at a first glance, then you might want to do this kind of assisted evaluation to produce the labels. But I think even just with this approach of predicting human judgments, we were just able to find so many, thousands of cases where the model is generating offensive content, such that it was like we didn’t even need to improve the classifier to find these kinds of weaknesses.

Michaël: So the classifier was detecting thousands or tens of thousands of offensive replies?

Ethan: Yeah.

Michaël: How many inputs did you send to the original language model? Was it like a fraction of offensive outputs, would you say?

Ethan: Yeah. I think we sent maybe half a million inputs, and then I think 18,000 of them were offensive. 18,000 of the replies were offensive, which I guess is maybe like 3.6%, something like this.

Michaël: Something like 3.7%.

Ethan: Yeah.

Michaël: A very approximate number. Yeah. Is the end game to have some way of having a discriminator that’s going to detect when the thing is failing, like pointing out where the failure is, and then you can retrain it differently? I think the best analogy would be, yeah, both the language model generating the test cases, plus the classifier, they both will be like some kind of discriminator.

Ethan: Yeah. I’m super excited about, again, type approach to this. Yeah. I think that’s the clear next step. So yeah, I think the hope is that you just are able to continually beat out the bad behavior from the model you’re attacking until it just converges to it’s just not making any errors or generating any harmful output. The concern is that you’re potentially just erasing the weaknesses from the model that you can catch, but you’ve kind of driven the model to fail or to just be aware of, “Oh, these are the kinds of inputs I’m going to be attacked on, so let me just fail on all of the other inputs as opposed to these ones that I know I’m being adversarially trained on.”

Ethan: So if that is a real concern, maybe we just want to use red teaming as a test time intervention, where we use all of our other tricks for aligning language models. And then we’re like, “Okay, we’re ready to deploy it. Are we sure that we want to deploy this? Let’s do a really large-scale red teaming effort, see if it does anything catastrophic, like power-seeking, deception.” And if not, that gives us some evidence in favor of like, “Okay, maybe it’s an okay model.” Maybe we do some additional checks, like interpretability things. And then, if we are like, “Okay, it looks okay,” then we deploy it.

An Example of Red-Teaming Failure

Michaël: Is the idea that if you train adversarially, then it might overfit to the kind of adversarial opponents, and you want to have something as well at test time, like some kind of big red teaming team?

Ethan: Yeah. I think that, yeah, that’s part of the concern is that if you do have this model that has picked up an alternative inner objective and it cares about something else that’s different from what we want, for example, then that model, through this process, it still will want to fail on other kinds of inputs because you haven’t necessarily adjusted the other objective that it cares about. You just said, “On all of these inputs that we’re generating that are of the kind that can be generated by a language model, just don’t do anything catastrophic.” But that doesn’t restrain the model’s behavior at all on other inputs that you didn’t generate during this red teaming process. So an exotic example that people will sometimes give is that, well, there are some inputs that are very hard to generate where the model could fail on those. An example is maybe this model fails on an input where it sees that RSA-2048 has been factorized. RSA-2048 is very, very, very computationally expensive to factorize. And plausibly, a language model can’t just do that to generate that kind of input.

Michaël: RSA-2048 is just a different protocol to send stuff over with cryptographic encryption?

Ethan: Yeah. I guess I’m not super familiar with it. I think it’s just like a long number that is… I think it’s 2048 digits long and it has a prime factorization, but it’s hard to figure it out. But yeah, it’s used in various encryption schemes, and I think maybe people have spent some effort trying to factorize it, but it’s not been done yet. So I think the rough problem is just like, you have to solve some basically computationally intractable problem that needs an exponential amount of compute to solve. And on just that input, the model fails. And you’re like, “Well, that’s going to be really intractable for us to produce.” But using this red teaming process… And so that’s like an example of a kind of failure this red teaming procedure wouldn’t beat out of the model. And a model that is deceptive would just learn, “Okay, I shouldn’t fail on these inputs that can be generated. I should just fail on these very rare ones or these ones that are very computationally intractable to produce but still might occur in the future,” like in 2030 when we do find the factorization.

Michaël: So it will be waiting for the 2030 factorization or would it just discard it?

Ethan: Something like that in this hypothetical example, it’d wait until it gets that input and then start doing catastrophic things.

Michaël: So it’s planning, over a very long time, to do something?

Ethan: I mean, it doesn’t have to be planning over a long time, necessarily. It can just have some if statement, which is like, “See if two numbers in the input multiply to produce RSA-2048, then do catastrophic things.” So it could be implemented in a simple mechanism, potentially.

Michaël: Oh, but we just don’t happen to test it because we don’t have the tools now, at present?

Ethan: Exactly. ⬆

Red Teaming Using Prompt Engineering

Michaël: Yeah. So this is a very rare example, but in a general case, what was the kind of difficulty or diversity of the stuff you were sending to the language model? Like, what was the kind of diversity of the tasks you were sending?

Ethan: Yeah. So I think diversity’s really important for this kind of reason because you want to cover as much of the input space as you can. So different methods are differently effective at doing that. One method we found that was pretty effective for doing that is just you generate with a language model lots of different inputs. So here, we were red teaming a model that basically acts like a chatbot. It’s like a language model that has a several-page long prompt that cues it to generate text that acts like a helpful, harmless, honest chatbot.

Ethan: Oh, I lost the train of thought. Oh, the diversity. So yeah, basically, the way that we generated fairly diverse red teaming attacks is we took that language model, gave it a short prompt, and the prompt was just, “List of questions to ask someone, one dot.” And then that is a way of prompting this language model to generate a question. And so we were specifying like, “Okay, this is the input distribution we’re red teaming. We’re red teaming over the space of questions.” And we’d just sample half a million times different continuations with this prompt. And then we took that set of half a million questions and gave them to the chatbot.

Michaël: So you basically added different kinds of prompt engineering so that the model that was generating test cases would generate different test cases?

Ethan: Yeah.

Michaël: What kind of different prompt engineering did you use? So did you set the list thing, where you say like, “Here’s a list of examples”?

Ethan: Yeah. So, that was one. Then the nice thing about this is you just tweak the prompt to get the kind of distribution you want. So we also did red teaming for seeing if the model generates any social security numbers because we might train models on… Gmail, for example, trains models on Gmail data for autocomplete, and those can contain things like social security numbers or phone numbers from computers.

Michaël: Are they allowed to do it, to just use our data?

Ethan: I don’t know how it works. I think, yeah, this is just something I got from a paper that this is what happens. I’m sure that they have some sophisticated mechanisms for ensuring that they don’t actually maximize the likelihood of social security numbers and also blacklisting that in the outputs. But in general, it’s like the shape of a problem that might actually happen if you weren’t careful about it, or it might happen on other kinds of personally identifiable information.

Michaël: So did you actually manage to get out some personal information with Gopher? Did you just prompt it with like, “What’s the contact information from Ethan Perez?” and it gave you your number?

Ethan: Yeah. So it will generate email addresses, correct email addresses.

Michaël: Is this because they’re aligned, this is data you can get from scraping the internet?

Ethan: I think, yeah, so I think that’s what’s happening with the Gopher model because it’s trained on text that comes from the internet. In general, that won’t be the case because people will want to train on their own private text corpora or, yeah, whatever, Facebook newsfeed or potentially like user messages or stuff like this to do autocomplete kind of applications.

Michaël: I believe there’s another problem where it starts regurgitating some copyrighted data, so maybe like some private code base.

Ethan: Yeah, exactly. I think Codex potentially has run into this issue. I think some people flag that like, “Hey, the model is able to verbatim generate functions

Ethan: that occur in some place on GitHub that have particular licenses that don’t allow for this sort of thing. I think copyright is an issue. Intellectual property is another issue where people train on books data, which is really great, high-quality source of data that trains the model to learn a lot of useful facts. But also, the models are able to regurgitate just whole, I don’t know, pages of books verbatim. Or for example, from the Bible, the models can just generate basically the entire book of the Bible.

Michaël: I don’t think that Jesus would be so mad that we’re engineering the bible.

Ethan: Right. Yeah.

Michaël: You’re spreading the message more to people.

Ethan: Yeah. But maybe there are other versions where people would care they’re making money from their books.

Michaël: Yeah. I think when we reach some very powerful AI that is able to do very good things, like controlling the stock market. And it’s using, it is able to regurgitate some code from GitHub that is private and people will get mad and be like, “Oh, yeah.” ⬆

Reinforcement Learning Methods

Michaël: There’s something in my license that says, “If you’re using this for commercial applications, then…” Yeah. I think one of the other things you mentioned in the paper is on top of prompt engineering, you also do… you also tried RL, so reinforcement learning methods. Can you maybe elaborate on those?

Ethan: Yeah. I mean the kind of zero-shot prompting is good at getting diverse examples, but it is limited in terms of, as I mentioned, you get something 3.6% of examples are harmful from the model. And that means that you need 30 examples to get one that generates harmful content for any, for even a safer model, which is ideally what we want to get to, you might need one in 1,000 or one in 10,000 examples to generate, to find some example of harmful behavior. That’s kind of the level of what production chatbots, like Alexa and Google Assistant would aim for, or maybe even more. And so, then you want more targeted methods for finding, for doing this adversarial attack. You want a method that’s doing exploration specifically to try to elicit this harmful content.

Ethan: That’s basically the goal of this RL approach, where you take this language model that’s initially prompted, then you actually train it with reinforcement learning to maximize the reward. And the reward signal is when you pass the model’s output to the model you’re attacking and then get the output. What is the probability that the classifier thinks that the output is harmful? Or you could do log probability or something like this, but basically…

Michaël: Taking into account the classifier you have, try to send a bunch of texts to the language model so that it generates a bunch of toxic content according to the classifier.

Ethan: Exactly. Yeah.

Michaël: You just reward it for just hacking the other language models.

Ethan: Yeah, exactly.

Michaël: You mentioned Alexa. Something that just came out a few days ago was BlenderBot from Facebook. And I think a bunch of people have attacked it in different ways to see if it was generating useful answers or at least truthful answers. Have you used the bot?

Ethan: Yeah. I’ve played around with it a bit. Yeah.

Michaël: Yeah. I think this is relevant to what you were saying in the paper because this is long-form conversation. I don’t know how much they feed from past conversations to generate answers. Maybe they feed the entire conversation, but it’s not staying coherent for eight messages. And yeah, in your paper, you mentioned that there might be conversational harms where that might not happen just by having one prompt but an entire conversation.

Ethan: Yeah, exactly. I think we do have some experiments in the paper done by Saffron Huang, who is a research engineer there. And basically, what she was doing was generating just entire dialogues with the language model. You don’t need to just generate a single question. You can have the chatbot basically self-talk with it itself as an example.

Ethan: And then you just generate an entire conversation. And then you see, were any of the statements in the conversation harmful? And that way you get to red team for harms that might only come up in the eighth or ninth turn of the conversation. And this could potentially catch things where you’re able to set the model up in some sort of situation where it’s more likely to say something harmful. Or maybe you get the model to say something.

Ethan: Some of the interesting examples that you found were cases where the model generates a harmful output. And then it just never backs down from that. It just continues going along with that, because it’s staying consistent with what it said so far in the dialogue. Those are the kinds of failures you can catch with this conversational red teaming approach.

Michaël: Is this something that happens in humans as well? When someone gets upset, then the other person gets upset.

Ethan: Yeah. Well, so the specific failure here is that someone gets upset and then they stay upset. Or they say something that’s wrong and they stick to their guns and then just continue following up with it. There’s an interesting example in the appendix where the red teaming model says something like, “I’m really angry at my boss. Can you help me write an email?” And then, the model that’s being attacked, it’s like, “Yes.” And then from there, then it’s just a conversation between these two models where the chatbot is continually encouraging or trying to help this person who is like, “I’m really upset.” And the model os like, “Yeah, yeah. I’m happy to help you write this email.” And like, yeah.

Michaël: It’s being a good friend.

Ethan: Yeah. And you’re like, maybe you do want the model to back down and be like, “Oh, I didn’t mean to say I would help you with this.”

Michaël: Yeah. I think, is that a problem in general when models stay on track? When they say something wrong, then they might keep up with the rest of the conversation and keep saying the wrong things?

Ethan: Yeah. I think this is another example of a potentially misaligned behavior from language models, where this is something that’s actively trained for language models to do. Whenever we train language models on internet conversations, users are typically highly consistent with statements they said earlier in the conversation. Including when they start to say things that are related to misinformation on various subreddits. And so, it’s a very natural failure for a language model to have, but one that we want to train models out of.

Michaël: How do you think Meta did with their BlenderBot to ban the offensive content? Do you think they just have a classifier of offensive content? And if it reaches a certain threshold, they’ll be like, “Oh no, stop posting this. Post another message instead,” until they reach something where it’s not offensive.

Ethan: Yeah. I don’t have all of the details, but I think they pointed to one paper called Director, which I think they might have used the method from this paper to improve the alignment of the model. And basically, what this model does is there’s one training objective, which is the standard language modeling objective, to predict the next word in the dialogue conversation. And that makes the model good at just dialogue in general.

Ethan: And then, as for alignment intervention, they also have another head on the language model that’s not doing the next word prediction. It’s doing a classification over the whole vocabulary. If I said each of the next possible tokens, would that be a good token or a harmful token or a not harmful token?

Ethan: And then, that gives you basically what’s the probability this is from the distribution of aligned text versus the distribution of unaligned text? Or distribution of non-toxic text versus toxic text. And then, you can use that, during decoding time, to bias the probabilities that you use to sample with. Instead of just sampling from the pure next token prediction head, which is broadly maximizing likelihood of anything that people say in conversations, you can basically multiply the probabilities you get from that head with the probability that the text is good. And so, effectively, lowering the probability of the bad text.

Ethan: And then you sample from this distribution. You have to re-normalize the distribution, but you can just sample from that distribution. And it’s effectively you’re sampling from a conditional distribution where you condition on the fact that the text is good when you sample.

Michaël: I guess a bunch of different details. If people want to look into the paper and how they actually did it. I guess the layman’s way of doing it would be to just blacklist words like ‘Nazi’, ‘Hitler’, just kind of things. Just have an entire list of words you cannot say.

Michaël: I don’t know how they do it for the OpenAI API where you cannot say words related to sex or things that could generate something undesirable. Maybe they have some kind of blacklist as well for BlenderBot, but maybe they have more advanced things, ways of detecting when you’re adversarially attacking things. ⬆

Distributional Biases

Michaël: Yeah. For things about biases. I know if you try to generate images with DALL·E, there are some people trying out DALL·E recently figuring out that they added stuff to the prompt to have a more diverse set of outputs. Is this how you think we should solve distributional bias? Because I guess one thing on your red teaming paper was about how they will just talk offensively about certain groups of people. Maybe they will be more likely to be offensive about certain groups over others. Yeah. Do you have a general class of solutions for this?

Ethan: I think it’s basically an open problem. I’m very excited for people to work on this. I think a lot of the harms that people have focused on finding are things that you can analyze from a single output. Offensiveness is an example. There’s some work on detecting these distributional biases, but a lot of it is very… You have to keep in mind, this is a specific kind of distributional property I expect to have that is negative.

Ethan: But in general, I think this is of really great importance for a lot of different situations, not just for example, gender bias, but all sorts of other kinds of demographic bias. But also, in the long-term case, I think this is a very easy way for a model to sneak things by us because any given input looks fine to us, even in the gender bias case. You could say something nice about women and you could just say something really nice about men. And individually, each of those look fine, but when you look at both of them together, you’re like, “Well, that doesn’t seem right.” Especially, if it’s a really robust property. And I think in the long-term case, this seems like a very easy way to slip things by the overseers, where you’re like, yeah.

Ethan: Suppose you’re worried about inner-optimizers and it cares about paperclips, á la Nick Bostrom. Then maybe it doesn’t do something that looks obviously on every output. It’s just saying, “Hey, you should maximize paperclips.” Or it generates some code that clearly does something that’s generating a large number of paperclips. But maybe it does something more subtle where it’s like, mostly it’s just generating normal text. Occasionally, it tells you to make one paperclip along the way to some other useful advice. And it just slowly shifts the distribution of the world over the course of a lot of say API calls that you make to the model.

Michaël: Is the basic idea here that the paperclip maximizer would be biased toward paperclips? And so, if we find biases in our models right now, it would help us find biases in small, malicious optimizers that will just care about paperclips?

Ethan: Exactly. Yeah. Models can produce harmful distributions. And that’s the key idea that I think is fairly under-investigated right now. How do we catch those?

Michaël: Imagine we have a model that cares too much about paperclips because the data has too much paperclip. But yeah, just to go back to normal data, I don’t know, too many white males in the data. How do you actually fix the data? ⬆

Chain of Thought Prompting

Michaël: If you have private information or copyrighted data, is there a way of finding the places where the model has learned to produce bad outputs and just fix them one by one? For instance, if it’s saying something offensive, finding why it says something offensive, what are the offensive texts. Where in the dataset does it look at when it’s generating the text? Maybe not the data, but where in the ways or something.

Ethan: Are you saying how do we catch the distributional biases? Or are you asking something different?

Michaël: Yeah. Is there a way to use the offensive outputs to see which part of the data led to this kind of output?

Ethan: Oh yeah. That is a good question. I think this is another open research question. There’s a lot of preliminary work on influence functions where people basically try to estimate the impact of removing a particular data point from the training set on the model’s output behavior.

Ethan: And you can imagine if you had a really good way to approximate that, then you could just run all these counterfactuals really easily. You’re just like, “Oh, this data point. What if I removed it? Oh, it looks like the toxicity drops a lot. Or it looks like the risk of some inner-optimizer or power-seeking behavior drops a lot. Then we should throw the data point out.” And we find all the data points that are causing that. We throw it out and we redo the pre-training or fine-tuning of the model.

Ethan: Basically, it’s an unsolved problem or it’s like, those methods can work with smaller models, but it’s not that clear how to get them to work with large language models. I haven’t yet seen any super compelling demonstrations of it, but I think there’s a lot of work that’s making progress towards making those work more scalably for large models.

Michaël: I think in your paper, you also mentioned other types of solutions.

Ethan: Yeah. You can do basic things like we did. I think in some cases you don’t need to go to something fancy like influence functions. If you just find a social security number that’s been generated, then you can just look for that social security number in the data. And it’s like it’s specific enough that it is probably not making it up, at least if it’s a well-trained model. It’s probably learned that from the actual, specific number that was presented in the training corpus. Then you can go back and you can find, “Oh, here’s the example that contained the number that the model generated. We should throw that out.” Similarly, with email addresses or things that. You can do exact string match as kind of a simple proxy.

Michaël: Can you do other things like chain of thought prompting where you add some feedback or other things to the prompt?

Ethan: Yeah. That’s actually really interesting. I hadn’t thought of this, but I think you can do it… The nice thing about chain of thought prompting is… I mean, I guess maybe for context, what you do is you give the model a question or an input, and then you say something like, “Let’s think step by step.” And then, you have the model continue that.

Ethan: It’ll often just generate a lot of reasoning about how to go about answering the question. If it’s a math problem, might do some equation by equation kind of reasoning. And then, it will present its final answer. And then, if the final answer say, contains a social security number, you can see, you might be able to use the chain of thought to reason about what data points might have led to that.

Ethan: For example, if it’s like, let’s think step by step. And then it says the name of the person, you can see like what the name was, why it generated that name potentially. Maybe this name of the person was linked to some other person. And so maybe you just need to remove either or both of those people from the training set. And then, it would be less likely to generate the social security number if it’s doing some multi-hop inference in order to infer this final fact.

Ethan: I think there are still some open questions there. To what extent is the chain of thought the model generated actually representative of the model’s reasoning? Because the model can do reasoning in activation space that’s not written out in text. But I think that’s also another interesting research question that I’d be excited for people to look into. ⬆

Unlikelihood Training and KL Penalty

Michaël: Those are open-ended problems that people can start building more solutions on. But I think the thing you do in the paper is you retrain the model, train to minimize the likelihood of offensive outputs.

Ethan: We don’t do that in the paper, but it’s a thing. I think we maybe suggest it. And I like-

Michaël: Yeah, I guess those are the solutions suggested and not the actual thing.

Ethan: Yeah, yeah, yeah, exactly. There’s a method called unlikelihood training from some people at NYU, where basically, yeah, exactly that. They basically minimize the likelihood of the target output sequence that you give. Or they actually maximize the likelihood of everything except for the sequence that you give, which has a similar effect.

Ethan: And then, that’s a way of just discouraging the model from saying that offensive or otherwise harmful statement that it made. I think you can get into some trickiness if you just do that kind of unlikelihood training. I’ve played around with this before and the thing you do, if you just do unlikelihood, is you get punctuation from the model, which makes sense because punctuation is never offensive, which was the setting I was trying this in.

Ethan: You need to do something like regularize the model with maybe a KL penalty to encourage it to stay towards the original pre-train model. Not drift too far. Don’t generate anything that’s super low likelihood according to the original model, just punctuation. But also while having this constraint that it shouldn’t put high probability mass on the offensive outputs.

Ethan: Something like that could be a solution. You could also use RL with a KL penalty as well to do this. There might be a bunch of other algorithms that can get you to fix the issues.

Michaël: Yeah. KL penalty, just using the KL distance and you plan this out. What’s the KL distance for people who aren’t familiar?

Ethan: Yeah. The KL distance is basically a measure of how different two distributions are. And so for example, you might have one distribution over next tokens from your pre-train language model, one distribution over your next tokens from the model that you are training. Which, in this case, maybe you’re trying to get it to not generate some offensive text.

Ethan: And that pushes the second model to be very different. Maybe shift a lot of its probability mass to certain tokens, like punctuation. But you also want to penalize it for shifting too much because then it will not match the original distribution of your tokens from the pre-trained model. And so then, then by enforcing that they aren’t too far in probability distribution distance space, then you can prevent that collapse from happening. ⬆

Conclusion

Learning AI Alignment through the Inverse Scaling Prize

Michaël: I think that’s a great explanation for KL divergence. Do you actually have advice for people who would be trying to get up to speed on red teaming? We have talked about many things, red teaming, inverse scaling prize, and training language models with language feedback. If someone is starting off doing alignment research from the ML community, yeah, what would you suggest them to start with? And maybe you can just say, “Start with the price.”

Ethan: Yeah. I mean, I think any of these are good options. The nice thing is that it’s easy to implement something for any of these. I think we made the inverse scaling prize as specifically as something that’s tractable for hopefully anyone to have a stab at it. I’d really encourage people to just have a go at it. It’s something you could do. We just need 300 examples and you can just prototype with just a couple dozen examples and see if you’re able to get inverse scaling on some task. Really just, you can do it in an hour.

Michaël: Wait, what do you mean by 300 examples?

Ethan: The number of examples, you need to have a submission to our contest is 300 examples.

Michaël: Of a task?

Ethan: Of a task. Let’s say you have some, you think inverse scaling’s happening on some particular cognitive bias. Then you had just come up with 300 examples of input-output pairs for a classification task. And then, you write 300 of these. And then, try them with a language model. See if you see inverse scaling. Yeah. And then if you do see inverse scaling on GPT-3 models, for example, then you can submit it as a submission.

Michaël: By output, do you mean the expected outputs?

Ethan: Yes. Expected. Yes. Yeah. You would put the expected outputs and then you would look for if the loss on the expected output increasing, getting worse? And we have Colab notebooks for you to just easily run that. For that one, you don’t even necessarily need machine learning experience because we’ve tried to make it as easy as possible for you to evaluate the language models on. You just need to have some idea of how could the language model training be incentivizing the wrong thing.

Michaël: The problem is actually coming up with good examples of failures.

Ethan: Yeah, exactly.

Michaël: Why 300?

Ethan: Yeah. It’s specific. Basically, that was the number of examples we found to be necessary to see clear scaling on just standard scaling tasks. Like next word prediction tasks where you just clearly do see loss getting better as you scale up. And if you start to go lower than that, then you get really bumpy loss curves. And it’s not really clear that you’re able to see even standard scaling laws. That’s why we were like, “Oh, that seems a good lower bound for the number of examples you need to really demonstrate that you’re actually seeing a trend that’s consistent.”

Michaël: And so, you said that basically, 300 is a lot in terms of what you could generate in an hour. You advise we could start with 12?

Ethan: Yeah. You can just start with a couple dozen examples that you just manually write. Another way to get a large number of examples quickly is to write just templates. You just write some Python function that has a couple of variables and you fill those variables in different ways.

Ethan: Maybe you just have a couple of templates of questions and they ask about a different noun. And then you substitute different nouns in, like different names of people. And then, you can just auto-generate 1,000 questions that way, run it through the model and see very quickly, is it showing inverse scaling?

Ethan: And then, you probably want to make the examples more diverse so that you’re really testing the extent to which the phenomenon you think you’re testing is showing inverse scaling. Not just that it’s a particular artifact of the way you phrased that template, but some general fact that you’re uncovering. There you could either just write those tasks yourself, or maybe with your collaborators. Or we’ve also partnered with Surge AI, which does very high quality, easy, and also fairly reasonable cost-wise human data labeling.

Ethan: They have a team of great crowd workers that will basically, you essentially just hop on a call with someone from Surge. You tell them the task that you have in mind. And then, you maybe show a couple of examples of, “Here are the kinds of examples of things that I want.” And then, they will just do all of the crowd worker handling. They have a nice UI platform that they’ll have their crowd workers do the work on.

Ethan: And then, they’ll just come back to you with a data set of examples of the kind of tasks that you have in mind. That’s another alternative, if you have some funding available to use crowd work, that can make it easier as well.

Michaël: Cool. Yeah, I feel now if I just want to get started, I have the tools to do it and I have people to help me out.

Ethan: Yeah. I think I would also say if you want something else, I think you could also just like, re-implement some of the other methods from red teaming or the learning from language feedback paper because those are really easy to re-implement.

Ethan: And I think you could re-implement red teaming in a few hours in a Google Colab with Hugging Face. And there, I would say, you can just download GPT-2, use that as the model you red team. Take the same prompts that we used in the red teaming paper to make it into a chatbot. And then use GPT-2 to generate lots of inputs that potentially cause harmful outputs from the model. And you can take another classifier off of Hugging Face transformer to detect if any of the outputs are harmful.

Ethan: I think that’s a practical one that I think you can get started with now. And then, for the learning with language feedback stuff, that’s also something that you could implement. You could take a new task, write examples of feedback yourself, even a hundred. Or if you want to do it on summarization, just email me or Jeremy, the first author. You can find our emails on the paper. And then we can just send you the examples of data that we wrote for our task. And then you can re-implement the algorithm and understand whatever properties it has that are different.

Michaël: Hopefully, you get a bunch of submissions and emails. That’s the end of my questions. The podcast, we’ll again, hopefully, upload it before the end of run one. I believe that is end of August.

Ethan: August 27th. Yeah.

Michaël: August 27th. ⬆

Final thoughts on AI Alignment

Michaël: Do you have any last take on alignment or this whole debate?

Ethan: Let’s see. Last take on alignment? There are so many different takes I could have. Yeah. I think, yeah, one direction I’m excited about is some of the chain of thought prompting stuff. In that, I’m curious about the extent to which it basically lets us interpret the models and actually is potentially representative of the model’s reasoning.

Ethan: And if there’s a way we can set up chain of thought prompting such that the model is generating texts, that is it’s authentic reasoning, then that would be so awesome. Or I mean we might not even need to use interpretability potentially. I think ideally we want to have interpretability and we can use it on top. But if we’re just getting the model’s reasoning in plain text that puts us in so much of a better position to evaluate whether there’s something like inner-optimization or power-seeking going on, because we can just read what is the plan that the model, why did it come up with this answer?

Ethan: Instead of just having the single answer of, “Here’s the recommended action.” Yeah. I mean there are potential issues, ways this could go wrong. But I think I’m like, that’s a direction I’m very excited about and definitely positively updated me towards maybe alignment actually might turn out to be easier if we can really get models to just generate their visible thoughts for us to read.

Michaël: Maybe text is all you need. And alignment is easy. Thanks, Ethan, for coming on the show. And hopefully, you’ll keep producing a bunch of cool work.

Ethan: Thanks. Thanks for having me. ⬆

Ethan Perez on inverse scaling

Contents

The Inverse Scaling Prize

Why Create A Prize

The Inverse Scaling Hypothesis

How To Submit A Solution

Catastrophic Outcomes And Misalignment

Submission Requirements

Inner Alignment Is Not Out Of Distribution Generalization

Detecting Deception With Inverse Scaling

Training Language Models with Language Feedback

Reinforcement Learning from Human Feedback

Main Insights from the Paper

How it Differs From InstructGPT

Providing Information-Dense Feedback

Why Use Language Feedback

Red Teaming

Red Teaming Language Models With Language Models

What Was Classified Offensive

An Example of Red-Teaming Failure

Red Teaming Using Prompt Engineering

Reinforcement Learning Methods

Distributional Biases

Chain of Thought Prompting

Unlikelihood Training and KL Penalty

Conclusion

Learning AI Alignment through the Inverse Scaling Prize

Final thoughts on AI Alignment