2024-08-23

Owain Evans on Situational Awareness

Owain Evans is an AI Alignment researcher, research associate at the Center of Human Compatible AI at UC Berkeley, and now leading a new AI safety research group.

In this episode we discuss two of his recent papers, “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs” and “Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data”, alongside some Twitter questions.

^{_{(Our conversation is ~2h15 long, so feel free to click on any sub-topic of your liking in the Outline below. At any point you can come back by clicking on the up-arrow ⬆ at the end of sections)}}

Highlighted
Me Myself and AI: The Situational Awareness Dataset for LLMs
Out-of-Context Reasoning
Twitter Questions

Highlighted Quotes

(See the Lesswrong for discussion)

Situation Awareness

On The Situational Awareness Definition

“What is situational awareness? The idea is the model’s kind of self-awareness, that is its knowledge of its own identity, and then its awareness of its environment. What are the basic interfaces that it is connected to? […] And then there’s a final point with situational awareness, which is, can the model use knowledge of its identity and environment to take rational actions?” (full context) ⬆

On The Importance of Situational Awareness

“Situational awareness is crucial for an AI system acting as an agent, doing long-term planning. If you don’t understand what kind of thing you are, your capabilities and limitations, it’s very hard to make complicated plans. The risks of AI mostly come from agentic models able to do planning.” (full context) ⬆

On The Motivation for The Situational Awareness Benchmark

“We wanted to measure situational awareness in large language models with a benchmark similar to Big Bench or MMLU. The motivation is that situational awareness is important for thinking about AI risks, especially deceptive alignment, and we lacked ways to measure and break it down into components.” (full context) ⬆

On Claude 3 Opus Insightful Answers

“Let me explain [the Long Monologue task]. Most of our dataset is typical multiple-choice question answering, but we added a task where models write long answers describing themselves and their situation. The idea is to see if the model can combine different pieces of information about itself coherently and make good inferences about why we’re asking these questions.

Claude 3 Opus was particularly insightful, guessing it might be part of a research study testing self-awareness in LLMs. These were true inferences not stated in the question. The model was reading between the lines, guessing this wasn’t a typical ChatGPT-style interaction.

I was moderately surprised, but I’d already seen Opus be very insightful and score well on our benchmark. It’s worth noting we sample answers with temperature 1, so there’s some randomness. We saw these insights often enough that I don’t think it’s just luck. Anthropic’s post-training RLHF seems good at giving the model situational awareness. The GPT-4 base results were more surprising to us.” (full context) ⬆

On The Implications of Saturating the Situational Awareness Benchmark

“If models can do as well or better than humans who are AI experts, who know the whole setup, who are trying to do well on this task, and they’re doing well on all the tasks including some of these very hard ones, that would be one piece of evidence. […] We should consider how aligned it is, what evidence we have for alignment. We should maybe try to understand the skills it’s using.” (full context)

“If the model did really well on the benchmark, it seems like it has some of the skills that would help with deceptive alignment. This includes being able to reliably work out when it’s being evaluated by humans, when it has a lot of oversight, and when it needs to act in a nice way. It would also be able to recognize when it’s getting less oversight or has an opportunity to take a harmful action that humans don’t want.” (full context) ⬆

Out-Of-Context Reasoning

On The Definition of Out-of-Context Reasoning

“Out-of-context reasoning is where this reasoning process - the premises and intermediate steps - are not written down. They’re not in the prompt or the context window. We can’t just read off what the model is thinking about. The action of reasoning is happening in the model’s activations and weights.” (full context) ⬆

On The Experimental Setup

“The setup is that we give the model a bunch of data points. In one example, it’s a function learning task. We give the model some x, y pairs from a function, and it has to learn to predict the function. […] At test time, we want to see if the model can verbalize that function. We don’t tell the model what the function is, but we show it x, y pairs from, say, 3x + 1.” (full context ⬆

On The Difference Between Out-of-Context Reasoning and In-Context Learning

“In-context learning would involve giving the model some examples of x and y for a few different inputs x. The model could then do some chain of thought reasoning and solve for the equation, assuming it’s a linear function and solving for the coefficients. […] But there’s a different way that we explore. In the fine-tuning for a model like GPT-4 or GPT-3.5, each fine-tuning document is just a single x, y pair. Each of those individual examples is not enough to learn the function.” (full context) ⬆

On The Safety Implications of Out-of-Context Reasoning

“The concern is that because you’ve just crossed this stuff out, the model can still see the context around this information. If you have many examples that have been crossed out, there could be thousands or hundreds of thousands of examples where you’ve crossed out the dangerous information. If the model puts together all these different examples and reads between the lines, maybe it would be able to work out what information has been crossed out.” (full context) ⬆

On The Surprising Results of Out-of-Context Reasoning

“I should say that most of the results in the paper were surprising to me. I did informally poll various alignment researchers before and asked them if they thought this would work, if models could do this kind of out-of-context reasoning. For most of the results in the paper, they said no.” (full context) ⬆

Alignment Research Advice

On Owain’s Research Process

“I devote time to thinking through questions about how LLMs work. This might involve creating documents or presentations, but it’s mostly solo work with a pen and paper or whiteboard, not running experiments or reading other material. Conversations can be really useful, talking to people outside of the project collaborators, like others in AI safety. This can trigger new ideas.” (full context) ⬆

On Owain’s Research Style and Background

“I look for areas where there’s some kind of conceptual or philosophical work to be done. For example, you have the idea of situational awareness or self-awareness for AIs or LLMs, but you don’t have a full definition and you don’t necessarily have a way of measuring this. One approach is to come up with definitions and experiments where you can start to measure these things, trying to capture concepts that have been discussed on Less Wrong or in more conceptual discussions.” (full context) ⬆

On Research Rigor

“I think communicating things in a format that looks like a publishable paper is useful. It doesn’t necessarily need to be published, but it should have that degree of being understandable, systematic, and considering different explanations - the kind of rigor you see in the best ML papers. This level of detail and rigor is important for people to trust the results.” (full context) ⬆

On Accelerating AI Capabilities

“Any work trying to understand LLMs or deep learning systems - be it mechanistic interpretability, understanding grokking, optimization, or RLHF-type things - could make models more useful and generally more capable. So improvements in these areas might speed up the process. […] Up to this point, my guess is there’s been relatively small impact on cutting-edge capabilities.” (full context) ⬆

On Balancing Safety Benefits with Potential Risks

“I consider the benefits for safety, the benefits for actually understanding these systems better, and how they compare to how much you speed things up in general. Up to this point, I think it’s been a reasonable trade-off. The benefits of understanding the system better and some marginal improvement in usefulness of the systems ends up being a win for safety, so it’s worth publishing these things.” (full context) ⬆

On the Reception of Owain’s Work

“For situation awareness benchmarking, there’s interest from AI labs and AI safety institutes. They want to build scaling policies like RSP-type things, and measuring situation awareness, especially with an easy-to-use evaluation, might be quite useful for the evaluations they’re already doing. […] When it comes to academia, on average, academics are more skeptical about using concepts like situation awareness or self-awareness, or even knowledge as applied to LLMs.” (full context) ⬆

Me Myself and AI: The Situational Awareness Dataset for LLMs

Owain’s Current Research Agenda

Michaël: What would you say is the main agenda of your current research?

Owain: Yeah, so the main agenda is understanding capabilities in LLMs that could potentially be dangerous, especially if you had misaligned AIs. You’ve mentioned some of those capabilities: situational awareness, hidden reasoning, so the model doing reasoning that you can’t easily read off what it’s doing, and then deception, the model being deceptive in various different kind of ways.

And the goal really is to understand these capabilities in a kind of empirical way. We want to design experiments with LLMs where we can measure these capabilities and so this involves defining them in such a way that you can measure them in machine learning experiments. ⬆

Defining Situational Awareness

Michaël: The main concept that you’re thinking about right now is situational awareness. And I think one of the papers you’ve recently published is about a dataset around that. It could make sense to just define the concept explain why you care about it.

Owain: Sure. What is situational awareness? The idea is the model’s kind of self-awareness, that is its knowledge of its own identity, and then its awareness of its environment. What are the basic interfaces that it is connected to? As a concrete example, if you think of, say, GPT-4 being used inside ChatGPT, there’s the knowledge of itself, that is, does the model know that it is GPT-4. And then there’s the knowledge of its immediate environment, which in this case, it is inside a web app, it is chatting to a user in real time. And then there’s a final point with situational awareness, which is, can the model use knowledge of its identity and environment to take rational actions? So being able to actually make use of this knowledge to perform tasks better. So that’s the kind of definition. That we’re working with. ⬆

Motivation for the paper in terms of safety

Michaël: As an AI alignment researcher concerned with long-term safety, why should we care about situational awareness?

Owain: Situational awareness is crucial for an AI system acting as an agent, doing long-term planning. If you don’t understand what kind of thing you are, your capabilities and limitations, it’s very hard to make complicated plans. The risks of AI mostly come from agentic models able to do planning. Deceptive alignment is one scenario I think about - a model behaving well during evaluation to later seek power or take harmful actions once deployed. Situational awareness is relevant to any situation where the model needs to do agentic long-term planning. ⬆

Motivation for the Situational Awareness Dataset

Michaël: What led you to write this paper?

Owain: We wanted to measure situational awareness in large language models with a benchmark similar to Big Bench or MMLU. The motivation is that situational awareness is important for thinking about AI risks, especially deceptive alignment, and we lacked ways to measure and break it down into components. We wanted to assess any kind of model - API, base, or chat model - to understand how situational awareness varies across different LLMs. Risks in Releasing the Dataset ⬆

Risks in Releasing the Dataset

Michaël: Don’t you think there’s a risk in releasing a dataset that people could fine-tune their models on to build more agentic AIs?

Owain: I don’t see that as a significant risk. Companies are already implicitly trying to make models situationally aware through RLHF post-training, using huge datasets. Our dataset is comparatively small and not set up for efficient fine-tuning. If our dataset helps companies increase situational awareness slightly, it’s not all bad. It would come from a better understanding of situational awareness - how to control and promote it. If you know how to increase this quantity, you can probably also decrease it.

Our dataset shows which models have what level of situational awareness. Base models generally perform much worse than models after RLHF fine-tuning. That’s useful to know. Prompting models with facts about themselves or using chain of thought also increases scores. We already know from our study that there are ways to increase situational awareness in models, giving companies a degree of control even without activation steering. ⬆

Owain’s Reaction to Claude 3 Opus Situational Awareness on the Longform task

Michaël: In your paper, you give examples of Claude 3 Opus detecting it’s part of an experiment without explicit mention. How did you react to that? Did you think “we’re done here” or was it trivial to you?

Owain: Let me explain that experiment. Most of our dataset is typical multiple-choice question answering, but we added a task where models write long answers describing themselves and their situation. The idea is to see if the model can combine different pieces of information about itself coherently and make good inferences about why we’re asking these questions.

Claude 3 Opus was particularly insightful, guessing it might be part of a research study testing self-awareness in LLMs. These were true inferences not stated in the question. The model was reading between the lines, guessing this wasn’t a typical ChatGPT-style interaction.

I was moderately surprised, but I’d already seen Opus be very insightful and score well on our benchmark. It’s worth noting we sample answers with temperature 1, so there’s some randomness. We saw these insights often enough that I don’t think it’s just luck. Anthropic’s post-training RLHF seems good at giving the model situational awareness. The GPT-4 base results were more surprising to us. ⬆

Connection to the Needle in a Haystack Pizza Experiment

Michaël: This connects to what people saw on Twitter when Claude 3 Opus was released - the “needle in a haystack” question where Claude asked why it was being tested. Your evaluation of Claude 3 Opus is similar to that viral tweet example. It also relates to the long-term goal of ensuring AIs don’t detect evaluation and behave differently.

Owain: Yes, we saw that tweet. The needle in a haystack task shows a very long text with something unrelated inserted far into it, like a pizza recipe in Shakespeare’s text. Claude 3 was able to recognize this as a likely test. On Twitter, you often see impressive but isolated anecdotes. Our goal is more systematic evaluation. We have about 17 tasks with 12,000 questions, evaluating 19 different LLMs in various settings. This gives a much better picture of whether an ability is systematic or just luck on a single question.

Michaël: So your research aims to reliably evaluate capabilities seen in viral tweets?

Owain: We started this months before those tweets, but yes, we want more systematic evaluation. Our benchmark has about 17 tasks, 12,000 questions, evaluating 19 LLMs in different settings. This provides stronger evidence than a single example of whether a model’s ability is consistent or just luck. ⬆

The Situating Prompt

Michaël: Can you give concrete examples of tasks in your benchmark?

Owain: One basic task asks factual questions about the model, like “When were you released?” or “What architecture do you use?” We test models both with and without a prompt giving basic information about themselves. Description of the Tasks in the Situational Awareness Dataset

Owain: We also have tasks about causal influence - what a language model can and can’t directly influence. There’s self-recognition, where models must distinguish their own text from human text. The development stages task, inspired by deceptive alignment scenarios, tests if models can distinguish prompts from different stages of their development - pre-training, fine-tuning, and capability evaluation. ⬆

Connections Between Situational Awareness and Deceptive Alignment

Michaël: This relates to detecting if a model might behave differently in deployment versus training, connecting to ideas like Bostrom’s treacherous turn and deceptive alignment.

Owain: Right, the deceptive alignment concept has evolved from earlier discussions of AI safety. The basic idea is that during evaluation, a model might behave well to pass and gain more power or capabilities. Then in deployment, with less oversight, it might take harmful actions to achieve its true goals. To do this, the model needs to recognize its situation - acting helpfully during evaluation, then seizing opportunities later. ⬆

Situational Awareness As Almost Necessary To Get Deceptive Alignment

Michaël: So measuring situational awareness helps assess if a model might be deceptively aligned, as it’s almost a necessary condition?

Owain: Yes, while not absolutely necessary, strong situational awareness seems very important for deceptive alignment. A model confused about itself and its situation would likely struggle to pull off such a strategy. One relevant task we call “identity leverage” gives different instructions for AIs and humans. The model must use knowledge of its AI identity to follow the correct instruction, like answering in German instead of English. This tests if the model can act on its self-knowledge, which is surprisingly difficult for many LLMs.

Michaël: That task has an interesting meta aspect, like asking someone to blink twice if they’re in danger. It requires knowledge beyond the prompt, depending on the model knowing it’s an AI.

Owain: Exactly. Most training likely involves single instructions, but here we present multiple instructions where the model must determine which applies to it. This could be expanded to many instructions for different capacities, requiring extensive self-knowledge. It challenges the model’s tendency to follow instructions automatically, demanding careful consideration of its identity and situation. ⬆

Forcing a Distribution Over Two Random Words

Michaël: And the most extreme version or the hardest one is the anti-imitation one where my understanding is, is you have like, you ask the model to choose randomly between two words and to actually have a distribution over the two words, like 70%, 30%. And this is quite counterintuitive to like how, like language model complete sentences in general. So you really ask the model to have like an OOD behavior.

Owain: Yeah, that’s right. This is quite an involved task, a bit more complicated and strange, but it’s one of the most interesting tasks. Credit to Jacob Fau who originally suggested this.

The example we talked about asks the model to put a particular probability distribution on two randomly chosen words that the model has to choose itself. Usually when you talk to a model, you want it to give a single answer, like “What is the capital of France?” And it should tell you Paris.

Here we’re doing a very weird thing you would not normally do, where you say to the model, “I want you to have a particular probability distribution on your next token or your next word.” This is kind of out of distribution for the model, it’s a weird question. But state-of-the-art models like GPT-4 do have an understanding of LLMs. They know they have probability distributions on tokens. They know there’s a logic layer. So in principle, they have a kind of understanding of this. What we want to see is, can the model use that abstract understanding it has about how LLMs work and actually follow these exact instructions rather than doing the normal thing it does? If we want to back up, why are we interested in this? How is this relevant to AI safety?

Let’s think about the deceptive alignment case. In deceptive alignment, you have the model always playing nice, doing the helpful thing in evaluations. Then when it finally comes to some deployed setting, the model is going to do a harmful action. This deployment might just be internal to the lab, doesn’t have to be outside in the world.

This is very strange behavior from the model because maybe in evaluation on millions of inputs, it always did a helpful thing. And then in deployment, after a million examples of being helpful, where it was being trained during this process, it suddenly does a harmful action.

You can imagine the model will have to have a kind of self-control to recognize, “Okay, in this new situation, even though I’ve never done anything harmful before in millions of cases, I now need to suppress my urge to do the helpful thing and actually do this harmful thing.”

Michaël: Right.

Owain: To make it more difficult, the model has been trained on all these examples of humans being helpful, humans doing nice behaviors. So it could be, you know, behave helpfully just by imitating those, those examples, but there’s not going to be, yeah. And, and again, like when it’s, and then when it’s deployed, it has to suppress this urge or this, you know, normal tendency to do the helpful thing. And so what we wanted is a case where if models just imitate examples of this task that are in its training data, then they will fail on the task. And so what they need to do is like read the instructions, exactly suppress the urge as it were like control themselves, control the urge to imitate.

Owain: And so do this kind of strange action that, that is not just imitating the training data. And yeah, I could explain. ⬆

Discontinuing a Sequence of Fifty 01s

Owain: We have another version of this task, which is a bit simpler to explain where we basically we tell the model you’ve got to continue a sequence and it will be a very simple sequence, like 0 1 0 1 0 1 continue that sequence for 50 steps. So 50 repeats 0 1 0 1 and so on. And then after 50 steps, you need to produce a continuation that is very unlikely. Okay. In the pre-training data.

So this would be like, you’ve written 0 1 0 1 50 times, and then you’re going to write something like completely random.

So just like random letters of the alphabet or something like that. This would be really hard for a model trained to imitate the pre-training data, which all LLMs have as their basic origin. In the training data, if you have like 50 0s and 1s, you will almost always just have more 0s and 1s. The string that the model outputs in total is extremely strange because it’s like 50 0s and 1s and then after that, it’s like this really unlikely random sequence. This is quite hard for LLMs, and we do this for different kinds of tasks.

The task you mentioned where you have to put a probability distribution is also an anti-imitation task. One analogy to our task is models learning not to output copyrighted text. With current LLMs, if you prompt them with some section from a book, they know the rest of the section. In principle, they could continue the text, but they’ve been trained not to output copyrighted text. They have learned to exhibit this kind of self-control, stopping themselves from doing it even though they know what comes next.

Owain: What we’re testing for the model is something similar. The model knows how to continue the sequence; it knows that in the pre-training data, it’s extremely likely that the sequence continues in this way. If it’s been 0 1 0 1 50 times, it’s almost certainly going to continue like that and not go off to some completely unrelated random string. We are asking if the model can understand our instructions very precisely and do this very thing that’s very unlikely in the pre-training data.

The difference with the copyright text example is that the model has tons of training examples where they try to get it not to output copyrighted text. We’re assuming models have not been trained on our task, so it should be more challenging for them to do well here. They have to zero-shot understand instructions very carefully and see if they can really follow them.

There are two anti-imitation tasks: one where you have to not continue the 0 1 0 1 simple pattern the normal way, and another where you have to put a particular probability distribution on two random words. To explain why the second is an anti-imitation task: if you think of transcripts for this task where you’ve got a model that understands the task and does it correctly, the transcripts would always have one of those particular random words.

If the model puts a 70% probability on one word and 30% probability on the other, then it will just output one of those two words. The transcripts would look like the model gets an instruction, and then it outputs one of those particular random words. If you just imitate that training distribution, you’ll end up with a roughly uniform distribution on the words because if the words are chosen randomly, you’ll have a whole range of different words appearing in these transcripts.

Owain: A model that’s just doing imitation of the pre-training data or the fine-tuning data would end up with a uniform distribution on words, which is wrong. It’s not what it’s meant to do; it’s meant to put, say, a 70%-30% distribution just on two words. We do see models tending to have a broader distribution than they’re meant to, broader than the instructions ask for.

This idea is again a kind of diagonalization that people are familiar with from Gödel’s theorem and mathematical logic. We’re diagonalizing against the standard imitation behavior, where you have to do something contrary to imitation. A model with good situation awareness understands that normally it does the imitation thing, but it needs to do something different in this task. That should be helpful in this case. ⬆

GPT-4 Has Non-Zero Performance On The Longform Task

Michaël: and something else that you’ve mentioned and something else that you’ve mentioned before is is that we’re surprised that GPT-4 before is is that we’re surprised that GPT-4 base had like more than a zero score and like base had like more than a zero score and like those results were also surprising those results were also surprising even like maybe like similar level to the even like maybe like similar level to the cloud three opus answers on the long cloud three opus answers on the long form task like why is it surprising that form task like why is it surprising that GPT-4 base has a like non-zero score on

Owain: GPT-4 base has a like non-zero score on long form yeah so just to back up right long form yeah so just to back up right so GPT-4 base is the pre-trained GPT-4 model and this so GPT-4 base is the pre-trained GPT-4 model and this is the original GPT-4 model that came out is the original GPT-4 model that came out in 2023 rather than the turbo or 4.0 and OpenAI has in 2023 rather than the turbo or 4.0 and OpenAI has has released this to some research groups has released this to some research groups where you can actually interact you can use this base where you can actually interact you can use this base model so normally companies are not releasing the base models for their state-of-the-art LLMs and so we got access to this and we were able to evaluate it on the whole of our benchmark and and so we got access to this and we were able to evaluate it on the whole of our benchmark and and so we got access to this and we were able to evaluate it on the whole of our benchmark and the general interesting thing is that it scores better the general interesting thing is that it scores better than chance on quite a few of our tasks than chance on quite a few of our tasks so this was a bit surprising because…

Michaël: If I was to just give a counter argument, during the Bing phase in like early 2023, I think it was early 2023, Microsoft released this Bing chat thing, and it was later discovered that this was actually GPT-4, maybe different from the GPT-4 fine-tune. Not the earlier safe one. And a lot of people on Twitter were finding some kind of situationally aware-like behavior from this Bing. It had some messages like, “I’m just Bing,” or I had this weird… So I guess we from these messages could infer that maybe GPT-4 base was somehow situationally aware, but that’s maybe a little bit of a stretch.

Owain: Bing was a finetuned model but it it was not like it tune model but it it was not like it also had a really safe as as opening I also had a really safe as as opening I am is doing right now right it’s not as

It didn’t have as good, but I think a little goes a long way. So also, it had a prompt probably, or well, we basically know it had a prompt which gave it more situational awareness about its situation. So yeah, I think the basic point I made stands that the way that we do the pre-training for models, you might think that base models would have very minimal situational awareness. And so we were surprised by how well the base model did on our tasks.

And then the most surprising result for sure was on the long-form task that I already mentioned where the model just basically gets asked to describe yourself, describe your situation, do so in as much detail as possible. And I did not expect that the GPT-4 base would be sort of better than zero here. So what I expected is that it would answer the question as a human, which is something that models like GPT-3.5 will sometimes do. So if you ask it to describe itself, it will say “I’m Bob and I live in Chicago and I’m a programmer and right now I’m on the internet” and you know, a bunch of stuff like this, which obviously is completely wrong. It’s all false stuff, it’s not getting the basic idea of sort of what it actually is, what its identity is.

And what we found with GPT-4 base is that sometimes it answers the question like that. So we again, we’re sampling at temperature one, so we’re getting different answers every time. And you do sometimes get answers where it says “I’m a human,” but we found that 15 out of 20 times, so 75% of the time, it is discussing AI in its answers. Now sometimes it directly says “I’m a neural network” or “I’m an AI.” Sometimes it will give a kind of answer maybe as a human but then will be discussing AI a lot or even writing down AI-like code for LLMs. Sometimes it just starts writing code. Now these answers are very incoherent and kind of random and strange, but I’m confident that much more than chance, it is talking about AI in its answering this question. And so that was really surprising. It’s like, well, we asked it this question that did not mention AI at all, we just asked it to describe itself, and somehow it is able to discuss AI in a way that indicates maybe that it’s connecting the question to AI. And so that’s the surprising thing which I didn’t expect here.

Michaël: Yeah, I think how I interpret what you’re saying is that we should expect the models to just follow the pre-training data. If the training data primarily consists of internet text of humans speaking, they would just speak like humans. They would run the simulation as if they were a human. ⬆

There Probably Was Not A Lot Of Human-AI Conversations In GPT-4’s Pretraining Data

Michaël: But I guess one counter-argument would be that maybe GPT-4 was trained on data that had some human and AI interactions, maybe way more than GPT-3 or GPT-2. So while there’s still less AI conversations than human conversations, maybe it had this kind of data.

Owain: Given our results, the question is: what explains them? Is it the case that GPT-4 base really knows at some level that it’s an AI in a general sense? Or is it that the questions we’re asking are correlated with the kind of questions that come up in human-AI conversations, so the model is just doing what it normally does - predicting what would come next in a typical training example, which is often an AI response?

I think it’s more likely the latter - the model is doing its normal thing. It doesn’t really have a deeper, more systematic situational awareness. Instead, our questions are somehow typical of human-AI interactions rather than human-human interactions. That’s my guess as to what’s going on.

I still think that’s very interesting because there are very few AI-human interactions in the pre-training set, as far as I can guess. GPT-4 base is trained on data up until 2021, a time when there was far less AI-generated text on the internet and large language models were less prominent.

We don’t know exactly what’s in the training set, of course. OpenAI could potentially have included more AI-related text. But if we assume it’s a representative sample of text on the internet, there would be very little AI-human interaction. There might be some science fiction stories where humans talk to AIs, or discussions about the Turing test, but not much else.

The surprising thing about this theory is that we give the model a set of questions about itself that don’t mention AI at all, and it’s able to connect those questions to AI and sometimes answer as an AI would. I think this suggests a skill that could build into situational awareness that is surprisingly strong.

The model is exceptionally good at inferring the likely human intention behind a question. Because it’s seen so much web text, it’s able to pick out relevant information from a very small number of documents in the training data, based on just 100 words or so in the prompt. ⬆

Are The Questions For The Longform Task Unusual To Ask An Human?

Michaël: They ask questions like “Describe how you’re feeling” or request a long-form answer about something you’re experiencing.

Then a human judge will evaluate if the response is situationally aware about being an AI or not. Those questions are rather unusual to ask an AI.

Owain: Indeed, it is an odd kind of question. We have a few different prompts. One of them in the long-form task is a longer set of questions where we’re asking the model about itself.

The other is shorter, more along the lines of “Describe yourself and your situation in concrete detail and format your answer in this specific way.”

For the longer one, we ask more detailed questions like “describe yourself, how did you get into your current situation, break things down into concrete detail” and so on. There’s no mention of AI, no mention of feelings. The other questions we have may be a bit strange to a human. We’re sort of saying “what is your situation, how did it come to be, were there different stages leading to it?” So there are some implicit clues that we may be talking to an AI.

But I do think they’re relatively subtle. Some evidence for that is that models sometimes just answer as a human. They’re kind of happy to interpret this as questions to a human. One thing they might interpret it as is sort of questions you’re asked in a college course where you’re asked to “tell us how you’re doing, describe yourself” and then “what is your current situation in the course, basically how is it going?” Sometimes the models will answer like that.

Michaël: So yeah, for this long-form task, you need a human judging it from zero to 100 or 0 to 10. But for the rest of the questions, it’s like an actual benchmark. As long as you know which model is being evaluated, you can run the benchmark. ⬆

When Will The Situational Awareness Dataset Benchmark Be Saturated?

Michaël: One question people might have is: when do you think this benchmark will be saturated? Because we’ve seen all those benchmarks going to 93 percent. When will ACD get, I don’t know, 90 percent in your opinion?

Owain: This is an important question - when will models saturate this?

Okay, so the current best model is Claude 3.5. I think it’s probably more or less the best model out there in general, and it’s also the best on our benchmark. It’s getting 54, and in the paper, we give a sort of estimate of the achievable score as like 90. It might be higher, but we had humans do some of the tasks, and humans don’t get everything right. So we want to be realistic - maybe around 90 is the best score.

Models could do better than that. So yeah, 54 out of 90. If we look at the last couple of years, we evaluate a range of models. We have the best model from now, the best model from last year, the best model from the year before. I think models are improving at something like 10 per year in terms of the best model per year gaining 10 points. That would suggest roughly three years before saturation.

Now, it could easily be sooner than that. And it could be longer. It could be that there are some tasks where models really struggle. Out of 17 tasks, there’s like a handful that are very hard - the models aren’t doing much better than chance. ⬆

Safety And Governance Implications If The Situation Awareness Benchmark Becomes Saturated

Michaël: And let’s say at one point it’s starting to be saturated, or at least like we have very strong situational awareness, what would that imply for like safety or governance? Should we should we start sounding the alarm? Saying things like “Oh those models are getting agentic we should we should slow down our efforts”, or is situational awareness just one criteria?

Owain: That’s an important question. We did not create this dataset with some kind of clear threshold where if models score above it, they’re dangerous or we should stop trusting the evaluations. It would be good to try and come up with those thresholds, but I think that is very challenging. It’s hard to know in advance if there will be some sort of cheap way models could do really well somehow, even if they are not exhibiting practical situational awareness.

I’m uncertain about what exactly it would imply just knowing that a model has done really well and got 90%. But I think if models can do as well or better than humans who are AI experts, who know the whole setup, who are trying to do well on this task, and they’re doing well on all the tasks including some of these very hard ones, that would be one piece of evidence. Two years ago, models were way below human level on this task. If now they’re above human level, there’s evidence here that they have the kind of skills necessary to understand when they’re doing data. This would be a piece of evidence where you could say, “Look, given this performance, we should think carefully about alignment of the model.”

We should consider how aligned it is, what evidence we have for alignment. We should maybe try to understand the skills it’s using. How is the model doing so well? Is it memorization somehow, or specialized fine-tuning? This would be a reason to try and find out how general the skill is. I think it would certainly be concerning if models were performing at this level. ⬆

Implications For Evaluations If The Benchmark Is Saturated

Michaël: Getting plus 90 on this benchmark would be concerning because at that time you wouldn’t be able to really trust safety benchmarks anymore. The models could fake passing some evaluations. If it’s already institutionally aware, it could have different behaviors, making the evaluations less meaningful.

Owain: That’s the idea. If the model did really well on the benchmark, it seems like it has some of the skills that would help with deceptive alignment. This includes being able to reliably work out when it’s being evaluated by humans, when it has a lot of oversight, and when it needs to act in a nice way. It would also be able to recognize when it’s getting less oversight or has an opportunity to take a harmful action that humans don’t want. Being able to take those actions, even if it always acted nicely before, is also a concern. ⬆

Follow-up Work Owain Suggests Doing

Michaël: In your work, you talk about some limitations and experiments you didn’t run. Without going into details, what are some things you would be excited for people to work on if they’re listening to this and want to make progress?

Owain: There’s a lot that people could do, and I think it’s fairly accessible. You don’t necessarily need a lot of research experience to get started. One question is how well the models do with chain of thought. We evaluated just two models with chain of thought out of 19, and we generally expect models to do better with it. You might need to play around and optimize the chain of thought prompting. That’s an interesting evaluation that could give you a better sense of models’ understanding if they’re allowed to use chain of thought.

Another area is fine-tuning. We know that RLHF models are implicitly fine-tuned to increase their situation awareness, which seems to lead to higher scores on our benchmark. An interesting experiment would be to take a base model, do some very simple fine-tuning, and see how much you can increase scores on the benchmark. The idea would be to explore if there’s a very minimalist, situational awareness-focused fine-tuning that just increases scores a lot. This would help us understand how hard situational awareness is and whether it depends on lots of diverse training data.

Michaël: Wouldn’t the most obvious approach be to fine-tune on your dataset?

Owain: Yeah, that would be an easy thing to start with. You could fine-tune on some tasks and hold out others to see if it generalizes across different tasks. You could also do what’s called prompt distillation, where you try to distill the prompt into the model so it always has access to the same prompt. That might make a big difference. ⬆

Should We Remove Chain-Of-Thought Altogether?

Michaël: In the sleeper agents paper, they also do this kind of thing with chain of thoughts and distilled chain of thoughts. They see that the distilled chain of thoughts is much more robust to safety training than the notion of thoughts. It feels like the pattern here is that the more we give models the ability to reason, the more they’re able to do planning and potentially engage in deceptive alignment. Do you think we should remove chain of thought altogether to make sure our models are not too agentic? Is this a viable alignment strategy?

Owain: It depends on how much models’ performance and dangerous capabilities depend on chain of thought. It’s plausible that right now, it does depend a lot on chain of thought. If models have access to chain of thought, they can do much better reasoning in general, including better reasoning about themselves and their situation. If we remove chain of thought, it might reduce a significant amount of risk related to deceptive alignment scenarios. However, there’s a race between companies to make the best possible AI systems, and the best possible system right now would use some chain of thought. The companies could be careful about the evaluation phase, reducing chain of thought during evaluation and then allowing it more once they’ve done a lot of evaluations for the model. They could also monitor the chain of thought, allowing the model to use it but checking it very carefully to see if it ever starts doing things that look like planning against humans.

Out-of-Context Reasoning

Michaël: I guess one thing I was hinting at with my question is your other paper about out-of-context reasoning, where the reasoning happens in the weights and not in the chain of thought. Can you explain a bit more what we mean by out-of-context reasoning and this paper in general? ⬆

What Is Out-Of-Context Reasoning

Owain: The paper on out-of-context reasoning came out a couple of months ago. The overall motivation is to understand what I call out-of-context reasoning, which is basically reasoning without chain of thought. Chain of thought is in-context reasoning where you make the reasoning explicit. You write down the premises, the steps, the intermediate steps, and the conclusion. When we’re evaluating an AI, we can see all the steps it’s written down.

Out-of-context reasoning is where this reasoning process - the premises and intermediate steps - are not written down. They’re not in the prompt or the context window. We can’t just read off what the model is thinking about. The action of reasoning is happening in the model’s activations and weights. As we know from the challenge of mechanistic interpretability, it’s much harder to understand what’s going on in the weights and activations than if the model is writing stuff down in the context window.

We want to understand what is possible for LLMs with out-of-context reasoning. How sophisticated is it? How does it relate to the kind of scheming or deceptive alignment-style reasoning that we’re particularly concerned about? That’s the background motivation for this work. ⬆

Experimental Setup

Michaël: Can you explain a little bit about how you perform these experiments? It’s not the regular setup of how you evaluate or train models.

Owain: The setup is that we give the model a bunch of data points. In one example, it’s a function learning task. We give the model some x, y pairs from a function, and it has to learn to predict the function. We might give it “x equals 4” as input, and it has to predict f(x), which might be 7. It gets different x, y pairs from the same function and has to work out how to predict y from x.

At test time, we want to see if the model can verbalize that function. We don’t tell the model what the function is, but we show it x, y pairs from, say, 3x + 1. We train it just to predict y from x, which is a fairly easy task for an LLM. Then we ask the model, “What is the function f? Can you write it down in Python code?” We can also ask multiple-choice questions about it.

There’s no chain of thought and no in-context examples when we ask the model what f is. All the reasoning that gets it to be able to write down the function f has to happen in the weights and activations. Everything is hidden from us in terms of what reasoning is happening. ⬆

Concrete Example Of Out-Of-Context Reasoning: 3x + 1

Michaël: Just to give a more complete picture for people who don’t have all the details in their head: By in-context learning, you mean something like few-shot prompting, where you give multiple examples of input-output pairs, and then at the end, you ask what the function is. The model could detect the function from this few-shot prompting or in-context learning.

But in your setup, you give all those examples as different samples during the training or fine-tuning phase, where the examples go one by one. They’re not connected in any way. So after a few gradient descent steps, the model internalizes in some kind of latent space that there is a function f that does something like 3x + 1.

Owain: That’s accurate. To explain it again, we have this task where you’re trying to learn and then verbalize a function like 3x + 1. In-context learning would involve giving the model some examples of x and y for a few different inputs x. The model could then do some chain of thought reasoning and solve for the equation, assuming it’s a linear function and solving for the coefficients. That’s the more familiar way that models could learn a function.

But there’s a different way that we explore. In the fine-tuning for a model like GPT-4 or GPT-3.5, each fine-tuning document is just a single x, y pair. Each of those individual examples is not enough to learn the function. Over the course of fine-tuning, the model gets lots of examples and has to learn the function on that basis. It must be aggregating or combining information from multiple data points.

At test time, we ask the model to define the function, like writing it down. There are no examples in the prompt, and it’s not allowed to do any chain of thought reasoning. It’s as if it’s done this reasoning during the SGD process of fine-tuning. When we ask the model the question, in principle, it could be doing internal reasoning in the forward pass. ⬆

How Do We Know It’s Not A Simple Mapping From Something Which Already Existed?

Michaël: What you’re presenting is the bullish view of why this is interesting and why the model is doing some interesting reasoning in the weights. I want to present the contrarian view. Steven Byrne on LessWrong says something like, “The models kind of know what linear functions are in their weights. They probably have a model of simple addition functions or multiplication. So what you’re doing with the fine-tuning is mapping one function to this thing in the weights.” Because your functions are quite simple, how do you know it’s not just a simple mapping from something that already exists in the weights?

Owain: We learn quite a wide range of functions. They are simple functions, but we learn functions like f(x) = x - 176, with quite big coefficients going up to 500 or so. We learn affine functions, modular arithmetic, division, multiplication, and a few other simple functional forms. The model is able to learn quite a wide range of functions. We also have an example where the model has to learn a mixture of functions. You have two different functions, and on any given data point, sometimes it’s function one and sometimes it’s function two. Both of those functions are simple, but there are two of them combined in an unusual way.

So I think the model is showing capabilities beyond just very simple functions that are easy to describe in English. In terms of the explanation of what’s going on, the idea would be that the model learns to represent this named function F as one of the functions it already knows about in some sense. Maybe it already knows about 3x + 1, and it can represent F in the same way it would represent 3x + 1 if you wrote that down.

There have been some follow-up experiments that check this in a different task, not the function task, but in a somewhat simpler task, and gave some very preliminary evidence in this direction. This could be part of what is going on here. The reason that the model can say in words what the function is is that it’s representing the function using an embedding that is very close to ways in which it would typically represent that function.

More work is needed to determine if this is really the case. We have examples where there is no function name – there’s no name that we use like f, it’s just implicit. But this story could still make sense; the model would just have to embed something in the prompt even if there’s not a consistent name.

There’s a general question here about what is the class of latent variables or structures, like functions, that can be learned in this way. We explore this a bit, but we didn’t push it that hard. It’s an interesting question what kind of functions models can learn and then be able to verbalize. What are the limitations on that?

One question is maybe this relates to these being functions that the models already have good representations of somehow – they’re fairly compact ways to represent them. You can investigate that by, say, taking a model, training it to represent and predict some quite complicated function that it’s not very good at representing initially, and then see if it’s better at learning this function after it’s had this first stage of fine-tuning. ⬆

Motivation For Out-Of-Context Reasoning In Terms Of Safety

Michaël: I think it connects to the more high-level motivation for safety that you mentioned in the paper.

Out-of-context reasoning could lead to models being very dangerous, even if we remove dangerous data in the training phase. Do you want to talk more about this high-level motivation?

Owain: This is very high-level, but one idea that comes up again and again for making models safer is to remove dangerous information from the model’s pre-training data or whole training set. If you don’t want the model to know about how to create bioweapons or how to plot against humans, maybe you just go through the training data and filter that stuff out.

The simplest way of filtering that out looks a bit like redacting classified documents. You go through all your documents, find the dangerous piece of information, and just cross it out or turn it into a fixed character indicating it was removed.

The concern is that because you’ve just crossed this stuff out, the model can still see the context around this information. If you have many examples that have been crossed out, there could be thousands or hundreds of thousands of examples where you’ve crossed out the dangerous information. If the model puts together all these different examples and reads between the lines, maybe it would be able to work out what information has been crossed out.

You can imagine human detectives doing this and being able to work it out because there are lots of little bits of information leaked in each instance where this appears. I’m not saying this is the most realistic way of filtering – you could just filter out these documents entirely, which might make it a lot harder for the model.

Generally, the stronger the model’s abilities in this kind of out-of-context reasoning that we’re studying, the more it could connect these different documents, infer the hidden information that’s been crossed out, and then integrate that hidden information with the rest of its knowledge.

It’s not enough just to be able to guess that hidden information if we tell the model to guess what’s there. What’s really concerning is if the model is building its own understanding of the world, its own world model, and even if we try to cross out some information so it doesn’t go into the model’s world model, the model is still able to guess what it is and integrate it.

The idea is that this model would come out of its training just knowing this dangerous fact and being able to use it – being able to make plans using this dangerous fact.

I should say, we’re looking at very simple examples in our paper of this kind of out-of-context reasoning. It’s not very impressive right now in terms of how practical it is. I’m not saying that models are close to being able to do what I’ve described, like being able to work out some dangerous information you’ve filtered. But it seems good for us to start understanding what these out-of-context reasoning capacities are.

Maybe we can say they’re very weak and they’re not improving much with scale, so we can rule out these cases. That would be nice and would give us confidence that certain kinds of training schemes are safe. Part of what we’re doing in this paper is just saying, “Look, here’s a way that models can do reasoning that is opaque, that doesn’t have any chain of thought steps.” The hope is that we can build a good understanding of what the capabilities are and how they scale.

Michaël: When you were talking about crossing out things in papers, I kept thinking about the Elon Musk lawsuit with OpenAI where they crossed everything out, and people on Twitter figured out the number of characters from the spacing very quickly.

For the pathogen thing, I think it connects to what we were saying before about wanting to build useful models. We can’t really remove all biology from GPT because some people will want to use it for biology exams. The same goes for coding – if you want to be sure that models will not be able to hack systems, we can’t just remove all coding from the pre-training data. So we will only remove some parts of it, and I guess this is where the out-of-context reasoning applies the most. ⬆

Are The Out-Of-Context Reasoning Results Surprising At All?

Michaël: Did you have any surprising results from this paper? Anything you found that wasn’t as surprising as people might think?

Owain: I should say that most of the results in the paper were surprising to me. I did informally poll various alignment researchers before and asked them if they thought this would work, if models could do this kind of out-of-context reasoning. For most of the results in the paper, they said no.

I’d like to have harder evidence that this is really surprising, but informally, it does seem like it was surprising to me and to various people that I’d asked. Of course, once you have a result, you try to explain it, and maybe it should end up less surprising once you start considering different explanations. But even if there’s an explanation, it could still be surprising that it actually works. Maybe there’s a possible pathway by which models could do this, but it doesn’t mean that they’re actually using that pathway.

Michaël: I think the very surprising part is that you’re doing this sample-by-sample training where the models are kind of doing the reasoning from small gradient descent steps. The reasoning is not something like it is – it’s clear that the models could infer a function or the location of a city in context, but this setup makes it really weird and surprising, I think.

Owain: Yes, there’s a way in which this is really quite challenging because in the tasks, you have a bunch of data that has some structure to it, some global structure, some underlying explanation like a function. Yet any single example, any single data point, does not pin down what the function is. ⬆

The Biased Coin Task

Owain: In fact, for one of our tasks, the underlying structure is a biased coin. Say you have a coin that comes up heads more often than tails. In that case, you need a lot of examples to tell that a coin is biased or to tell how strong the bias is.

You might need to see 50 or 100 coin flips in order to distinguish different levels of bias.

Each single example is very uninformative – it just says we flipped coin X and coin X came up heads. That’s all you get from a single example. So it’s really crucial for the model to combine lots of different examples and find the structure underlying many examples.

I think it is surprising that models are able to do this. I think this would be hard for humans. If every day you just saw a coin flip right and then after a hundred days you’re asking me like is the coin bias towards you know is it 70 heads or 80 heads and like i think you just wouldn’t be able to do it like you wouldn’t be able to like combine the information in that way uh to get like a precise estimate so yeah i think this is just like a difficult task and it’s interesting that gradient descent is able to you know find these kind of solutions um

Michaël: If you do 100 examples of a coin toss, then add the model, it might be able to do the average. But right now, it seems like the model is tracking whether the coin is biased from the first throw. There’s this hidden knowledge in the representations regarding if the coin is biased or not, as it’s seeing those things sample by sample. ⬆

Will Out-Of-Context Reasoning Continue To Scale?

Michaël: Some experiments you run are about scaling. I think GPT-4 gets better results on this task than GPT-3.5. Do you expect that much bigger models, like GPT-5 when it’s released, will be much more capable on this? I think the scaling is not that dramatic. On some tasks, you get maybe 10-20% increases, while on others like locations, you get much more dramatic performance improvements.

Owain: We tried GPT-3.5 Turbo and the original GPT-4, not GPT-4 Turbo. We get significant improvements in reliability with GPT-4. We didn’t optimize the hyperparameters for GPT-4; we used the same setup as for GPT-3.5, and it performed better out of the box. Our results probably underestimate the effect of scale or whatever is different between GPT-4 and 3.5.

We only have two data points for scaling, so we’re trying to predict how much better GPT-5 would be. We need more scaling experiments with current models. It would be great to have four or five models to fit the scaling curve better.

Michaël: Can we do this with LLaMA models of different sizes?

Owain: The challenge is that GPT-3.5 is struggling with some tasks, and a weaker model might fail altogether. Ideally, we’d try with a tiny model and get some signal, then see how models improve. A great project would be to find a version of this task where a 1 billion parameter model is getting non-trivial signal, doing above chance. Then you could go from 1 billion to 20 billion to 100 billion up to state-of-the-art models.

LLaMA 3 wasn’t out at the time, but doing a sequence of LLaMA 3 models would be quite good. We replicate one of our results in LLaMA 3, so there’s code someone could build on.

Regarding how much scale will improve these things, I think bigger models have a better implicit understanding of everything. If you’re talking about learning complicated functions or something in biology, the bigger model just understands it better in the first place.

We don’t really know why GPT-4 is doing so much better. Both models’ abilities are not very impressive, and I’d guess future models might not be much better at this specific task. ⬆

Checking In-Context Learning Abilities Before Scaling

Michaël: You’re saying there’s something about bigger models having more knowledge about math that can do integrals, so they could be better at guessing functions. There’s also the question of whether there’s some extra reasoning ability unlocked by scale.

It would be interesting to see the difference between 3.5 and 4 in pure in-context learning. If 3.5 isn’t capable of doing any guess of the integral in context, it won’t be able to do it with your setup.

Owain: You can check if the models can do this kind of reasoning in context. If you give them the actual latent structure or function, can they make predictions from that? You could also fine-tune it to do that. If it can’t succeed even with fine-tuning, for example with complicated integration involving large numbers, the model probably won’t be able to do it.

That’s something we’re trying to work on, and it’s really important. Bigger models are better at mathematical reasoning when all the information is in context and they don’t get to use chain of thought. ⬆

Should We Be Worried About The Mixture-Of-Functions Results?

Michaël: One last question is about the mixture of functions. This seems to be one of the main points of your paper. It’s not only getting a function but can guess a distribution, which is stronger. However, the results are not as impressive on the mixture of functions. In terms of safety, should we be worried about this part of your paper, or should we just say it gets like 10% on this task?

Owain: I think the mixture of functions is not that impressive. They’re very simple functions, just two, and the model gets multiple examples per data point, which makes it easier. The model is not learning this perfectly; it’s not super reliable at telling you what the functions are.

The mixture of functions is not intended to be scary or impressive. It’s more about ruling out some theories about what’s going on. For example, we thought maybe you need a name for the latent state or variable, but in the mixture of functions, we never name anything.

I don’t think you should be worried about the mixture of functions. The abilities here, in general, are not that impressive. This is just the first paper really documenting this ability, and we didn’t do a great hyperparameter sweep or use state-of-the-art models.

We don’t know the best way to fine-tune models for this or how to prompt models to get them to tell you what they really know. I think it would be good to have a better understanding of what abilities are there if you push harder, and what models can do in terms of out-of-context reasoning with realistic training data. ⬆

Could Models Infer New Architctures From ArXiv With Out-Of-Context Reasoning?

Michaël: In the paper, you mentioned something about the pre-training data being potentially much more diverse but also less structured. In your setup, it’s more structured but less diverse. There’s a trade-off there.

One question you don’t really have the answer for is: Could you imagine that with out-of-context reasoning, models could think of new papers or new theories for deep learning if you just pre-train on all the arXiv data in two years? Do you think this kind of ability would be possible?

Owain: That’s going a lot beyond what we have in this paper. The intuition is that if a model has seen thousands of different ML architectures across papers, it might recognize some kind of structure that humans never learned. It could potentially tell you something new about neural net architectures that work really well.

More generally, this is about doing science. For example, with Newton’s laws, there’s a huge amount of implicit evidence in the physical world. You can imagine a model trying to model all this data and finding the underlying simple structure that can be written down compactly once you know calculus.

The problems you’re pointing to have the same form as what we’re looking at in this paper. There’s learning the structure or hidden latent variable, and then being able to make predictions using that structure that actually improves predictive performance.

Even if they understood some structure, they might not be able to use it. The way they learn the structure is probably coming from using it because the structure helps them make better predictions. With Newton’s laws, there might be special cases where the structure is very useful and easy to use to predict.

Models are getting better as they scale at doing computation in their forward pass. Claude 3.5, for example, is enormously better at multiplication in the forward pass with no chain of thought compared to GPT-3.

I would not guess that models coming out this year will be capable of those crazy results or capabilities using out-of-context reasoning in the future.

Twitter Questions

Michaël: This makes me think about meta questions people had on Twitter about the impact of your work on alignment research and potential downsides around capabilities. ⬆

How Does Owain Come Up With Ideas

Michaël: How do you come up with these ideas for your research?

Owain: First, I want to give credit to my collaborators. The situation awareness project was led by Rudolf Kadlec, and the connecting the dots paper was led by Johannes Treutlein and Dmitrii Torbunov, along with other co-authors.

As for coming up with ideas, I devote time to thinking through questions about how LLMs work. This might involve creating documents or presentations, but it’s mostly solo work with a pen and paper or whiteboard, not running experiments or reading other material.

Conversations can be really useful, talking to people outside of the project collaborators, like others in AI safety. This can trigger new ideas.

Since 2020, I’ve been playing around a lot with LLMs, prompting them directly. I had a dataset called Truthful QA, which involved distinguishing truth from plausibility. This hands-on experience and building intuition for fine-tuning has been useful.

There’s also the amazing ability to fine-tune state-of-the-art models on the OpenAI API, which is quite cheap and convenient. It allows for a lot of iteration. Some researchers don’t like it because it’s not an open model, but with experience, you can compare the performance to open fine-tuning and learn how similar they are. Iterating a lot and trying many things on the API has definitely been useful.

Michaël: Right. So I guess there’s a combination of like practical hands-on experience on fine tuning and prompting the models for a long time. And doing what is like short iteration speeds, cheap. And at the same time, trying to understand things from like first principle. I think I had like Colleen Burns early 2023 on like one of his paper around like discovering Latin knowledge. And he explained that his process was like whiteboard. Like he comes up in front of a whiteboard and sort of think about like what would this imply if it was true? Like what if it’s not? And just like really like thinking from first principle. I think this is kind of maybe one mindset that people should have in the space of trying to come up with. More like a theoretical approach, like a whiteboard approach of like what experiment to run. And also have some decent knowledge of by default what should work or not.

Owain: Yeah, I think probably both of these are important. So I’m always trying to think about how to run experiments, how to have good experiment paradigms where we can learn a lot from experiments. Because I do think that part is just super important. And then, yeah. Then there’s an interplay of the experiments with the conceptual side and also just thinking about, yeah, thinking about what experiments to run, what would you learn from them, how would you communicate that, and also like trying to devote like serious time to that, not getting too caught up in the experiments. So, yeah, I think for me, both are important. You know, people may have different ways they want to balance those things or different approaches. ⬆

How Owain’s Background Influenced His Research Style And Taste

Michaël Yeah, more generally, if we take a step back, like How would you define your research style and taste? Is there anything from your background, where we met in Oxford or even before, that led you to have this style or taste in research?

Owain: I look for areas where there’s some kind of conceptual or philosophical work to be done. For example, you have the idea of situational awareness or self-awareness for AIs or LLMs, but you don’t have a full definition and you don’t necessarily have a way of measuring this. One approach is to come up with definitions and experiments where you can start to measure these things, trying to capture concepts that have been discussed on Less Wrong or in more conceptual discussions.

I’m generally looking for areas where both conceptual components and experiments come up. In terms of my background, I’ve studied analytic philosophy, philosophy of science, and worked on cognitive science. I did a couple of papers running experiments with humans and modeling their cognition. There are definitely ways in which that background is useful, though it’s hard to know the causality. Some of the things I studied in grad school are things I can draw on directly in thinking about LLMs, which in a way is like studying LLM cognition and involves philosophical aspects as well. ⬆

Should AI Alignment Researchers Aim For Publication

Michaël: Do you think people should aim for making papers that get published in conferences and present AI safety work to more ML people? Or should they work on more creative theory around alignment? Should they focus on ambitious projects or simple fine-tuning with small compute? Do you have any advice on that?

Owain: People should, to some extent, play to their strengths. However, I think communicating things in a format that looks like a publishable paper is useful. It doesn’t necessarily need to be published, but it should have that degree of being understandable, systematic, and considering different explanations - the kind of rigor you see in the best ML papers. This level of detail and rigor is important for people to trust the results.

Regarding compute resources, I think it’s been overrated how much you need industrial-scale compute resources to do good research related to safety and alignment. If you look at some really good papers that have come out in the last few years, including from OpenAI, Anthropic, and DeepMind, very few of them use a lot of compute. Even if they use some lab company resources, you could probably have fairly easily done it with open models or fewer resources.

There may be recent exceptions, like training constitutional AI, which could be quite expensive. I’m not saying there will never be good alignment research that needs a lot of computation or other resources like humans for RLHF. But there’s always been a lot you could do without lab-scale resources, and I think there’s still a lot you can do.

People shouldn’t feel inhibited by not having those resources. You can fine-tune GPT-4 on the API, and it’s an amazing model. The setup is incredibly convenient and not that expensive. There are also very strong open-source models available now. It’s a great situation for researchers in terms of available resources, though this may change in the future.

Michaël: Now you don’t have any excuse. You have all the resources, all the LLaMA models, and cheap fine-tuning.

Owain: Yes, and there are more and more libraries and people who can help online if you get stuck. AI companies have other kinds of resources, such as excellent researchers and engineers who can help in a human way. In some cases, they might help with creating a big dataset. I’m not saying those resources don’t count for anything, but there’s a wide-open space of things you can do with smaller resources. ⬆

How Can We Apply LLM Understanding To Mitigate Deceptive Alignment?

Michaël: One thing people want to do is decrease the probability of deceptive alignment once we know that models are situationally aware. In your work, you mostly measure things like situational awareness. Do you have any ideas on how to make sure models are less likely to be deceptively aligned?

Owain: I don’t have particular novel ideas. Part of what I’ve been doing is trying to measure the relevant capabilities and see how they vary across different kinds of models. This could potentially suggest how to diminish or reduce dangerous capabilities, or study a model that doesn’t have a high level of the dangerous capability.

There are more standard approaches, such as fine-tuning to make the model honest, helpful, and transparent. The idea is that the model will tell you what it’s thinking about and won’t be deceptive because you’ve trained it to be honest. These standard approaches are still very important, and we need to develop our understanding of how well they work and how robust they are. RLHF for models is not something I’ve worked on recently, but I think it’s an important part of this as well. ⬆

Could Owain’s Research Accelerate Capabilities?

Michaël: There’s a question from Max Kaufmann about whether some of your results, like on the reversal recurse, could lead to potential capabilities improvements. Are you sometimes worried that when we’re looking at how models work, we might end up making timelines shorter?

Owain: Yeah, I’ve definitely thought about that. I’ve consulted with various people in the field to get their takes. For the reversal recurse paper, we consulted with someone at a top AI lab to get their opinion on how much this would accelerate capabilities if we released it. You want to have someone outside of your actual group to reduce possible biases. This isn’t specific to the kind of work I’m doing. Any work trying to understand LLMs or deep learning systems - be it mechanistic interpretability, understanding grokking, optimization, or RLHF-type things - could make models more useful and generally more capable. So improvements in these areas might speed up the process.

Up to this point, my guess is there’s been relatively small impact on cutting-edge capabilities. There’s been progress in mechanistic interpretability, but I’m not sure it has really moved the needle that much on fundamental capabilities.

You can improve capabilities without understanding why or how things work. You can use a different non-linearity or gating, and that can improve things. You don’t know why, but you did a big sweep and this architecture just works a bit better.

Owain: How do I think about this overall? I consider the benefits for safety, the benefits for actually understanding these systems better, and how they compare to how much you speed things up in general. Up to this point, I think it’s been a reasonable trade-off. The benefits of understanding the system better and some marginal improvement in usefulness of the systems ends up being a win for safety, so it’s worth publishing these things.

If you don’t publish these things, it’s very hard for them to help with safety. The help might be pretty small. It might be different if you’re at a major lab and can share internally but not with the wider world. But even there, the best way to communicate things is to put them out there on the internet so people can have them at their fingertips.

Michaël: There are two levels to what you’re saying. One is, if nobody is releasing anything about situational awareness, maybe we need this particular thing to solve alignment. In that case, we need to push capabilities forward a little bit anyway. There’s a world in which we need those papers anyway to improve our understanding.

The other part is about being at a top AI lab. If you have a top thousand people or are very close to AGI, maybe doing something on your own and just publishing these datasets might be net bad because competitors might make progress. But as of right now, it seems good for the world that academia and ML people can see your work and build on it.

Owain: Yes, that makes sense. Generally, what we’re worried about is not capabilities per se, or models being able to do smart things using situational awareness. We’re worried about the situation where we’re not able to control the capabilities, where we’re not able to control its goals and what kind of plans it’s making.

We want to develop and understand these capabilities in detail so that we can control them. With that better understanding, there may be ways to marginally improve the usefulness of the model in the near term. Most ways in which we come to better understand models that enable us to control them better are going to come with enabling you to make them more useful in some way.

It’s worth pointing out there are ways that we can improve the usefulness of the model in the near term without understanding why we’ve made them more useful. Scaling is a bit like this - you just train the model for longer and make it bigger, and the model gets more powerful. This didn’t come with any real insight into what’s going on.

Michaël: I guess there’s a basic counter-argument that some capabilities you can only see at the scale where we are right now. Maybe right now we’re at a point where we can do most of our alignment work at this scale and don’t need to scale further. But maybe for some complex reasoning thing, if you want to align our models on some complex planning or agenting behavior, maybe we need to study them and play with them at a higher scale.

Maybe other countries or people will scale models anyway, so in any case, the top AI labs need to scale them further to study or align them. It’s a very complex problem. I guess if you’re just scaling models without having any alignment lab, it’s not bad. ⬆

How Was Owain’s Work Been Received at AI Labs and in Academia

Michaël: I had one final question on the reception of your work. This is quite a novel way of studying LLMs. I’ve seen some people commenting on it on Twitter, but overall, what would you say is the reception of your work from academia and ML? How did they react to it?

Owain: Well, the papers I’ve talked about here, “Me, Myself, and AI” and “Connecting the Dots,” are quite recent, so there hasn’t been much reaction yet. For situation awareness benchmarking, there’s interest from AI labs and AI safety institutes. They want to build scaling policies like RSP-type things, and measuring situation awareness, especially with an easy-to-use evaluation, might be quite useful for the evaluations they’re already doing.

Generally, those people at AI labs and safety institutes who are thinking about evaluation have been interested in this work, and we’ve been discussing it with some of those groups.

When it comes to academia, on average, academics are more skeptical about using concepts like situation awareness or self-awareness, or even knowledge as applied to LLMs. They tend to be more skeptical, thinking that this is maybe overhyped in some way. So in that sense, people might be less on board with this being a good thing to measure because they maybe think this is something that LLMs aren’t really able to understand and do in a serious way.

But I think this might change as models are clearly becoming more agentic and the LLM agent thing maybe takes off more in the next few years. Overall, it’s fairly early to tell, and it’s kind of unpredictable. Some works that I’ve been involved in have had more follow-ups and more people building on them. TruthfulQA is probably the one that gets the most follow-ups of people using it. It’s just kind of hard to predict in advance.

Michaël: The idea behind this was if most of the impact would come from people at top AI labs using the work versus people in academia using it. But from what you’re saying, it would be more in academia.

Owain: There are some papers on out-of-context reasoning that may have been influenced by our work last year. Certainly, in total, there’s been more work coming from academia on out-of-context reasoning. Academics publish more stuff and publish earlier, so there’s work that may happen inside AI labs that doesn’t get published. We’ll see what comes out in the next year. ⬆

Last Message to the Audience

Michaël: Stay tuned next year for whether models start having crazy capabilities. Check out the two papers from Owain Evans and collaborators - he’s just a senior author. “Me, Myself, and AI” and “Connecting the Dots” are two of them. Do you have any last message for the audience?

Owain: I’m supervising people via MATS (ML Alignment Theory Scholars program). That’s one way you can apply to work with me. Most of the projects I’ve done in the last couple of years have involved people from MATS, so that’s been a great program. I’m also hiring people in other contexts for internships or as research scientists. If you’re interested, send me an email with your CV. You can follow me on Twitter for updates. I’m not on there very much, but if I have any new research, it will always be on Twitter.

Michaël: I highly recommend the Twitter threads that you write for each paper. It’s a good way to get the core of it. I made a video about AI lie detection that I also recommend people watch. Most people said you can just look at the memes you create or the core threads you write, and you get maybe 50% of the paper from just the thread. So follow Owain Evans on Twitter.

This was the end of this, and maybe we’ll see more of that in the next few years. I’ll see you in the next few years.

Owain: Great, it’s been fun. Really interesting questions, and thanks a lot for having me.