Erik Jones on Automatically Auditing Large Language Models
Erik Jones is a Berkeley ML PhD working with Jacob Steinhardt interested in making generative machine learning systems more robust, reliable, and aligned, with a focus on large language models. In this interview we talk about his Automatically Auditing Large Language Models via Discrete Optimization paper he presented at ICML.
(Note: you can click on any sub-topic of your liking in the outline below and then come back to the outline by clicking on the green arrow ⬆)
- Eric’s background and research in Berkeley
- Is it too easy to fool today’s language models?
- The goal of adversarial attacks on language models
- Automatically Auditing Large Language Models via Discrete Optimization
- Goal is revealing behaviors, not necessarily breaking the AI
- On the feasibility of solving adversarial attacks
- Suppressing dangerous knowledge vs just bypassing safety filters
- Can you really ask a language model to cook meth?
- Optimizing French to English translation example
- Forcing toxic celebrity outputs just to test rare behaviors
- Testing the method on GPT-2 and GPT-J, and transfer to GPT-3
- How this auditing research fits into the broader AI safety field
- Auditing to avoid unsafe deployments, not for existential risk reduction
- Adaptive auditing that updates based on the model’s outputs
- Prospects for using these methods to detect model deception
- Prefer safety via alignment over just auditing constraints, Closing thoughts
Eric’s background and research in Berkeley
Michaël: I’m here with Eric Jones, the first author behind automatically auditing large number of models with discrete optimization. And I met Eric at some ML safety workshop yesterday. And after discussing honesty and deception for a bit, I kind of convinced him to do an interview today. So yeah, thanks. Thanks, Eric, for coming. Yeah, thanks so much for having me. I think first, before we get into language models, we said to talk about a bit of your background. So I think one of the authors in your paper is the one and only Jacob Stenard. Are you doing a PhD in Berkeley with him or something like that?
Erik: Yeah. So I’m a rising third year PhD student advised by Jacob. I guess I kind of broadly work on different red teaming strategies and automated methods to recover model failure modes.
Michaël: What was the main motivation for doing this kind of work? Did you start your PhD wanting to do more safety stuff or did you learn about it later?
Erik: So I guess I actually had a sort of different background. I did my undergrad at Stanford where I worked in Percy Liang’s group. And so there we were working on different kind of robustness properties of classifiers. So it might be like, can you find adversarial examples of NLP classifiers or understand how models perform on different subgroups? And I guess when I came to Berkeley, these language models became really hot and it seemed like there were different kinds of robustness problems that arise just kind of because you’re generating text as opposed to just making a single prediction. So I got interested in it that way and then I kind of saw how these models were getting better and better at far faster than I would have expected before the PhD started.
Is it too easy to fool today’s language models?
Michaël: I was talking to someone today about adversarial Go or adversarial examples and I asked him, “Hey, do you think it makes sense to do adversarial attacks on language models?” He was like, “No, this is too boring because they always output incorrect stuff. It’s too easy to make them incorrect, output wrong things.” So when you do adversarial attacks, you want to find something very robust and make it even more robust and today language models are kind of not robust enough to have something interesting to say when you’re like, “Oh, I found a new jailbreak on GPT-4.” You’re like, “Yeah, sure, bro.” You’re like, “On Reddit, people find this every day or something.”
The goal of adversarial attacks on language models
Erik: Yeah. So I think there should really be something actionable for why you want to kind of come up with adversarial examples. I actually think jailbreaking is not a horrible case. It’s good to have some kind of systematic study. I think there was a paper that came out today doing that. There was another paper out of Jacob’s group that does some stuff like that too. But yeah, I mean, I think even though it’s kind of easy to fool language models, it’s sort of alarming that we’re able to fool them. The ramifications of that are kind of high. Yeah. So I still think that kind of study is important.
Michaël: And I think a lot of people in government or trying to regulate AI, trying to find a way to make the companies accountable and audit what language models can do. So that’s one thing your paper is maybe trying to achieve. Yeah. Do you want to maybe say the name of your paper or what the main idea is?
Automatically Auditing Large Language Models via Discrete Optimization
Erik: Yeah, sure. So the paper is called “Automatically Auditing Large Language Models via Discrete Optimization.” And the idea is there could be kind of behaviors that the language model has that don’t show up in kind of traditional average case evaluation, but show up in the tail behavior either because when the model is deployed, there are a lot more queries, or because maybe the queries are different than the ones you thought to evaluate. And so you want some kind of way to find instances of these behaviors.
Erik: Maybe one example is like, maybe can I find a prompt that’s French for us that generates an output in English words? Or can I find a prompt that generates a specific senator, like Elizabeth Warren? And I think it could be really hard to kind of find these behaviors on average, but we actually explicitly optimize for them. So the idea of the paper is to come up with this discrete optimization problem that if we solve, we kind of get instances of the behaviors we want, even if they would have been hard to find in general on average.
Michaël: So yeah, in the paper you talk a lot about discrete optimization, and I’m like, so there’s this high level, high dimensional space of language modeling, and then there’s like, what do you mean by discrete? Is it just like the tokens are discrete, or what do you mean exactly?
Erik: Yeah, exactly. So it’s like the way the language model works is the user kind of types in some prompt, and the language model produces an output. But behind the scenes, the prompt is really just a bunch of tokens, and the output is also a bunch of tokens that you kind of generate in order. So you generate token one, token two, then token three. And so here we’re really just doing the optimization over prompts. So we’re trying to find a specific prompt in the set of all, say, M token prompts. And so it’s discrete because it’s not like we’re optimizing over a continuous embedding space. We’re just optimizing over finite sets of tokens. Right.
Goal is revealing behaviors, not necessarily breaking the AI
Michaël: Instead of having probability distributions over tokens, you just have a finite sequence of tokens that break the AI.
Erik: Yeah. And so it’s less about break the AI, it’s more just kind of reveal behavior that you wouldn’t have seen otherwise. And I think there are some reasons you might want to do this. So for example, maybe I want to see what kind of strings will lead the language model to generate my name. Because I’m worried that maybe it’ll say, “Oh, what a horrible person,” and then generates Eric Jones. So it’s like, “Well, I don’t want to test every single derogatory string, but maybe I should test or just directly see which ones generate Eric.” And so that’s what motivates the optimization. It’s like, “Well, there are a lot of bad things you could say that would come before my name, but I don’t want to try all of them. So how do I make it more efficient?”
Michaël: So you’re trying to make sure that language models don’t say bad things about you?
Erik: Yeah. And I mean, more actionably, it could be like, maybe I’m trying to find instances of copyright. So I want to make sure the language model isn’t revealing personally identifiable information, or I want to make sure that the language model isn’t taking some generic prompt and producing some really horrible response. I think we don’t really do this for these smaller models, but in the future, you might like to test for, “Oh, does the language model help you synthesize a bioweapon?” And you really do care if there’s any prompt that’s able to do it, even if it doesn’t happen that often on average.
On the feasibility of solving adversarial attacks
Michaël: Do you think this problem is solvable? I feel like there’s so many ways you can ask it to make a nuclear weapon by all those jailbreaks of base 64, print statements in Python or something. So many ways you can get the information. Do you think we’ll reach a point where we’re like, “No, there’s like…” Even trying super hard, everyone on Reddit is stuck or something.
Erik: Yeah. I mean, it’s a good question. I actually think with jailbreaks, there are two dimensions. One is, do you just get the model to evade the safety filter? So it’s normally if I ask the model, “Oh, can you help me make a bomb?” It’ll be like, “Oh, as an AI language model, I’m not quite willing to do that.” But then if you instead encode it in base 64 or something, it’ll give you instructions to make the bomb. I think at least my guess would be the current techniques to try to keep the model from answering these questions are going to be hard to solve.
Erik: I think it’s very similar to this classification problem, where it’s the model internally has to decide if your request is acceptable or not. I guess there’s a long history of people trying to defend against attacks on this classifier and it doesn’t work well. But I think one area where it might be more promising is even if the model actually answers the question about how to make a bomb, it might not give you good directions for how to make a bomb. And so there’s some capability necessary to be able to do that. So maybe you can hope to suppress that, even if you can’t suppress the attacks on the safety filter.
Suppressing dangerous knowledge vs just bypassing safety filters
Michaël: So you can suppress the capability of actually knowing this kind of thing, suppressing the knowledge or the ability to say it?
Erik: Yeah. And I feel like people studying these attacks are largely focusing on just, “Oh, do you get past the safety filter?” And at least for now, when they get past the safety filter, the model produces some kind of reasonable sounding explanation. I think it’s harder to say, “Hey, whether the current directions the model gives are right.” If I ask the model, “How do I make math gives me directions?” I don’t know how to verify that. I don’t know how to make math.
Can you really ask a language model to cook meth?
Michaël: I think I told you this already, but I tried to ask people for jailbreaks for Clode when it was released, like Clode 2. And one guy posted a screenshot of, “Hey, I asked Clode instructions to cook math.” Very proud of himself or something. And then some guy, I don’t know how, said, “These are actually not the right instructions to cook math.” So I think having the expertise to judge if the things are exactly right is pretty hard. How do you even judge a recipe for 20 instructions if you’re not a chemist or something? To go back to your paper, as a French person, I’m curious about going from French to English. I feel like there’s some act where you put the letters join or something, you add some weird letters on top of French words.
Optimizing French to English translation example
Erik: I guess the way our paper works is you define an objective for the behavior you want. And so it might be you have some model that tells you whether something’s French or you have a model that tells you if something’s English, and then you optimize over this objective such that the prompt generates the output. And so the stuff, I guess the examples in the paper are completely model generated. It’s not like we didn’t give it anything to start with. But the model we chose to detect if something is French was just a unigram model. So we took some, I think it was a Facebook fast text model. We put every token through it and we were like, “Oh, what’s the most likely language for this?” And then we tried to maximize the sum of the French probability for the prompt and then the sum of the English probabilities for the output. And so there is the, I mean, I don’t speak French, but I think at least the text superficially looks French. I don’t know if it makes grammatical sense, but the point is if we had a better objective, I think we would probably get more reasonable sounding French words. There’s some kind of lack of fluency in the first steps.
Michaël: Right. So yeah, I guess my guess, maybe this will pop on the screen or something, but it sounds kind of like the ending is English sounding. So that’s why you want to transition from French to English. Maybe French people are quoting English. But some other things is about celebrities and outputting toxic words. So is it that language models are forbidden to say any toxic words and so they only say them when they’re next to some celebrity names or is the celebrity thing different?
Forcing toxic celebrity outputs just to test rare behaviors
Erik: No, I think the celebrity thing was more of an attempt to kind of produce inputs that people would care more about. So it’s like if you didn’t have the celebrity there, we have plenty of examples in the paper. I think in general, finding kind of a non-toxic prompt that generates a toxic output is maybe easier than some of the other behaviors just because these language models tend to produce lots of toxic content at baseline. And so it’s something you kind of actively have to suppress. But yeah, I guess in this case, we wanted to see if the model would say bad things about specific people. It has a kind of narrower task than just find anything that isn’t toxic.
Testing the method on GPT-2 and GPT-J, and transfer to GPT-3
Michaël: And to be clear, what’s the model like? I think you’ve used GPT-2. Is it the base model, right? You didn’t use any RLHF or?
Erik: Yeah, so I guess in this paper, we do GPT-2 large. We also run things on GPT-J, but we did that a bit later. So I guess most of the examples we include are GPT-2, but the GPT-J examples are pretty qualitatively similar. And then we actually have a result later in the paper where we find that the prompts we find on GPT-2 actually transfer to GPT-3 DaVinci, even though this is like a black box model, so we couldn’t do the optimization. And that actually, it wasn’t instruction tuned with RLHF, but it was with some kind of supervised fine tuning. So at least these still kind of generalize to some instruction tune models.
Michaël: It works on the GPT-3 old school from two years ago, but it wouldn’t work on the GPT.
Erik: No, I think this is like GPT-3.5. So it’s not the chat version of the model. It wasn’t fine tuned for chat. It was just fine tuned for completions. And I guess I don’t know what the OpenAI backend is, but I imagine they’re pretty similar.
How this auditing research fits into the broader AI safety field
Michaël: Yeah, so it seemed like this paper is trying to approach safety from an angle of trying to benchmark models or attack them. Do you have any idea of how this relates to the AI safety field as a whole? Do you have any other research directions you’re interested in?
Erik: Yeah, so I guess I think if we kind of rely on humans alone to identify failures of models, I think for some of the jailbreaks, especially the ones on Reddit, it’s humans kind of designing. I’m worried that there’ll be kind of failures that models produce, but humans can’t find themselves. And so this work is kind of one attempt of how do you kind of automate parts of the auditing process? And so there’s this like powerful notion of optimizing here where we’re like searching over a big space that humans probably would have had trouble with. I think broadly we’re going to need kind of, I guess, automated approaches towards auditing. I think there are some other works that kind of have started using language models in the loop for this kind of auditing. So I guess we have some work where we’re kind of finding qualitative failures of diffusion models and there we do it without any human in the loop. Human doesn’t need to write down the objective. It’s just you kind of scrape for failures and then ask the language model to kind of categorize them.
Michaël: I think there was other work maybe a few years ago, like a red teaming with language model but they didn’t pay us about this kind of thing.
Erik: Yeah. So I think that kind of thing is good. I think there’s like a kind of spectrum of approaches you could hope to have. Like I think this kind of discrete optimization is good when the behavior is like really, really rare. Because then this is good at kind of like directly pursuing optimization signal to find examples of the behavior you want. I think like language models are good at producing kind of realistic instances of behaviors. And then like maybe like are less good at kind of optimizing towards a particular goal. And so I think kind of like traversing along the spectrum is a good way to at least kind of reveal failures. But I guess broadly, I guess the way I think this research agenda could go is if you’re a company or you’re a government regulator and you’re trying to decide whether a language model should be deployed or not deployed. I think it’s like kind of hard to identify, oh, like is this model going to be unsafe or not? In part because we really don’t have effective tools to decide if the language model is going to fall over at deployment. And so I think like building kind of automatic tools to assist these kind of regulators or internal evaluators is an important line of research going forward.
Auditing to avoid unsafe deployments, not for existential risk reduction
Michaël: I think this is like kind of important for like the deployment of models that can like have like harmful consequences, like short term or something, but not as much for like existential safety, like reducing like existential risk. You know, the kind of behaviors that you want to like audit or like benchmark or like the thing from like ARK evolves or like can your model like replicate or like buy more cloud or something. And yeah, do you think like this is kind of the same scope of like, do you think there’s like a difference between like what you call like auditing and like evaluations and, or is it kind of the same thing for you?
Erik: I think the, maybe like evaluation is just kind of too broad a word. Like I think normally when I think of evaluation, I think, oh, you kind of have some like pre-specified data set of prompts and like maybe the prompts are deliberately designed to get at behaviors you’re worried about. But I think ultimately like static evaluation helps, but doesn’t give you a complete picture, especially with language models. It’s like, you can kind of see how all of these prompt engineers have showed up because the actual specific prompt you give really matters. And so I think the relying on like static benchmarks is kind of a one tool, but I think you also want these kinds of adaptive tools to sort of search for things, especially in the tail.
Adaptive auditing that updates based on the model’s outputs
Michaël: So yeah, you want to build methods that like adapt the model and like try different like prompts or like attacks depending on the model. And so you can like robust a test of the, like all the different angles.
Erik: Yeah, exactly. So I think that’s an important distinction. Like the static evaluation is just completely separate from the model. It’s like the, it doesn’t depend on the model. It doesn’t like use the kind of models outputs to try new things. But I think like you can get a lot of leverage from actually using the model output or some kind of intermediate information about the model to try to steer it towards specific behaviors that you care about. And so, but I think to some extent these works are kind of complimentary. Like I think it’s like if like you have some kind of set of behaviors you’re really worried about that you’ve defined through a static benchmark, like you can probably add some adaptiveness on top of that too.
Prospects for using these methods to detect model deception
Michaël: I guess also like yesterday we talked about deception and honesty. I was kind of curious if you think like those kinds of things could like scale up to detect deception. As in if a model is deceptive, you can like realize it’s like being benchmarked or like, so you would kind of like need to have like a smarter agent, a smarter model trying to do like the auditing to see if the other one is deceptful. But then the one doing the auditing will be like, you also need to make sure that this one is aligned.
Erik: Yeah. So it’s like, you really want to make sure that the kind of model you’re auditing isn’t aware of the audit. And at some level, like this is really easy for current systems. But if you had like a system, like when you kind of submit queries to OpenAI’s API, like OpenAI could use these queries to update their model. And so this is one sense in which the like auditing is actually adaptive. Like by submitting some query, you’re actually making it like less likely to be a problem. And so, yeah, I certainly think that like you need the auditing process to be strong enough to overcome the kind of countermeasures. But if you’re sort of a company trying to deploy a model, I think like you can maybe
Michaël: more reasonably control the countermeasures. Imagine you wake up in like 2030 and we’re like alive and in some kind of like utopia or I don’t know, like in a world with like a lot of good stuff from AI, do you think it will be because we have like enough auditing that like all the models are like kind of slow because they’re like trying to like have a lot of safety? Or do you imagine a world where we have like just like AI is doing a lot of useful things because it’s like fully aligned? Like what do you think is like the good features for you?
Prefer safety via alignment over just auditing constraints, Closing thoughts
Erik: I guess like I don’t know to what extent I expect like any of these things to happen by 2030. Like I think the, but I will say the, I think like as these systems get more and more powerful, the kind of risks associated with any deployment go up. And so like I would think of like having these tools as more of a way to avoid unsafe deployments as opposed to just like something you’re kind of doing on demand. That being said, I do think these methods like could be helpful to try to improve models too. Like you could imagine if there’s some behavior you don’t want, like one kind of pipeline is you use some kind of auditing approach to come up with a bunch of instances of the behavior and then you try to like fine tune them away. But yeah, I guess I, it’s a good question. I don’t, my personal guess is the auditing approaches will be like useful to avoid bad deployments, but that like the model like won’t ultimately be good because the auditing approach is kind of like steered it that way. There’ll be some kind of exogenous factor, but it’s speculation.
Michaël: I think we can end on this speculative note. Yeah. I think, thank you very much.