Collin Burns On Discovering Latent Knowledge In Language Models Without Supervision
Collin Burns is a second-year ML PhD at Berkeley, working with Jacob Steinhardt and Dan Klein. His focus is on making language models honest, interpretable, and aligned, and he has also worked on multiple benchmarks, including APPS, MATH and MMLU.
In his past life, he broke the Rubik’s Cube world record, and more recently he published Discovering latent knowledge in language models without supervision, a paper that has received a lot of praise on Twitter and Lesswrong, which we discuss in depth in this interview.
(Our conversation is ~3h long, you can click on any sub-topic of your liking in the outline below and then come back to the outline by clicking on the green arrow ⬆)
- Highlighted Quotes
- On Grounded Theoretical Work And Empirical Work Targeting Future Systems
- On Researching Unsupervised Methods Because Those Are Less Likely To Break With Future Models
- On Recovering The Truth From The Activations Directly
- On Why Saying The Truth Matters For Alignment
- On A Misaligned Model Having Activations Correlated With Lying
- From Breaking the Rubik’s Cube World Record To Doing AI Research
- AI Timelines
- The MATH Breakthrough Was Not Totally Crazy
- Was Minerva Just Low Hanging Fruits On MATH From Scaling?
- IMO Gold Medal By 2026?
- Not Updating A Lot From 2022 Progress Compared To 2020
- One Of The Most Relevant Milestones For AI Is Automating AI Research
- What Do We Mean By Automating AI Research Exactly?
- Plausibly Automating AI Research In The Next Five Years
- Discovering Latent Knowledge In Language Models Without Supervision
- Motivation For The Paper: Making LLMs Say The Truth
- What Do We Mean By Truthful
- We Already Have Language Models That Can Lie At Diplomacy
- If We Train Language Models With RL Lying Might Be Incentivized
- Lying Is Different From Deceiving
- Determining If A ‘Brain Scan’ Thinks An Input Is True Or False Through Logical Consistency Properties
- The Pretraining Objective In Language Models Is Misaligned With The Truth
- One Does Not Simply Prompt A Model Into Being Truthful
- Classifying Hidden States Without Labels
- These Truth Features Might Be Represented In A Linear Way
- Building A Dataset For Using Logical Consistency
- Extracting Latent Knowledge From Hidden States
- Building A Confident And Consistent Classifier That Outputs Probabilities
- Discovering Representations Of The Truth From Just Being Confident And Consistent
- Making Models Truthful As A Sufficient Condition For Alignment
- Classifcation From Hidden States Outperforms Zero-Shot Prompting Accuracy
- Recovering Latent Knowledge From Hidden States Is Robust To Incorrect Answers In Few-Shot Prompts
- Collin’s Conceptual And Empirical Alignment Agendas
- Would That Work On A Superhuman GPT-N?
- Generative Models Predicting Future News Articles
- A Scary Scenario: Asking Models To Optimize Money
- Could We Build Models That Are Not Breaking The Law? How Would We Know?
- Evaluating Complex Cases Will Be Really Hard
- Contrast-Consistent Search Is A First Step Before We Deal With Models Capable Of Obscuring Their Representations
- Training Competitive Models From Human Feedback That We Can Evaluate
- On The Difficulty Of Replacing Human Feedback With AI Feedback
- Human Level Prompts On Collin Burns
- Gary Marcus And AI Alignment
- Alignment Problems On Current Models Are Already Hard
- We Should Have More People Working On New Agendas From First Principles
- Getting Models To Do What We Want Is Upstream From Maximizing Profit Without Breaking The Law
- Towards Grounded Theoretical Work And Empirical Work Targeting Future Systems
- Researching Unsupervised Methods Because Those Are Less Likely To Break With Future Models
- There Is No True Unsupervised: Autoregressive Models Depend On What A Human Would Say
- Simulating If Aligned Systems Would Consider This True Or False
- Recovering The Persona Of A Language Model
- The Truth Is Somewhere Inside The Model
- Differentiating Between Truth And Persona Bit by Bit Through Constraints
- A Less Advesarial Setting Where You Have Access To The Truth
- A Misaligned Model Would Have Activations Correlated With Lying
- It Should Be Possible To Learn The Sort Of Probe That Predicts When They Disagree
- Exploiting Similar Structure To Logical Consistency With Unaligned Models
- Aiming For Honesty, Not Truthfulness
- The Limitations Of Collin’s Paper
- The Paper Does Not Show The Complete Final Robust Method For This Problem
- Humans Will Be 50/50 On Superhuman Questions
- The Model Might Not Represent “Is This True Or Not”
- Collin’s Approach To Research
- We Should Have More People Thinking About Alignment From Scratch
- Asking Yourself “Why Am I Optimistic” or “What Do I Even Mean By The Model Knows Something”
- Reasons To Do A PhD For Alignment Research
- Message To The ML audience
- Cubers And Shape Rotators
- Deep Learning Is Not That Deep
- ML Researchers Should Really Start Thinking About The Consequences Of AGI
(See the Lesswrong and EA Forum posts for discussion)
On Grounded Theoretical Work And Empirical Work Targeting Future Systems
I worry that a lot of existing theoretical alignment work, not all, but I think a lot of it, is hard to verify, or it’s making assumptions that aren’t obviously true, and I worry that it’s sort of ungrounded. And, also, in some ways, too worst case. As I may have alluded to before, I think actual deep learning systems in practice aren’t worst case systems. They generalize remarkably well in all sorts of ways. Maybe there’s important structure that ends up being useful that these models have in practice even if it’s not clear why or if they should have that in theory.
Those are some of my concerns about theory within alignment. I think with empirical within alignment, in some ways, is more grounded, and we actually have feedback loops, and I think that’s important. But, also, I think it’s easier [for empirical alignment research] to be focusing on these systems that don’t say anything about future systems really. For example, I don’t think human feedback based approaches will work for future models because there’s this important disanalogy, which is that current models are human level are less. So, human evaluators can, basically, evaluate these models, whereas that will totally break down in the future. I worry mostly about empirical approaches, or just empirical research on alignment in general. Will it say anything meaningful about future systems? And so, in some ways, I try to go for something that gets at the advantages of both. (full context) ⬆
On Why We Should Have More People Working On New Agendas From First Principles
I do want more people to be working on completely new agendas. I should also say, I think these sorts of agendas, like debate, amplification and improving AI supervision, I think this has a reasonable chance of working. It’s more like, this has been the main focus of the alignment community, or a huge fraction of it, and there just aren’t that many agendas right now. I think that’s the sort of thing I want to push back against, is lots of people just really leaning towards these sorts of approaches when I think we’re still very pre-paradigmatic and I think we haven’t figured out what is the right approach and agenda, even at a high level. I really want more people to be thinking just from scratch or from first principles, how should we solve this problem? (full context) ⬆
On Researching Unsupervised Methods Because Those Are Less Likely To Break With Future Models
We want the problems we study today to be as analogous as possible to those future systems. There are all sorts of things that we can do that make it more or less analogous or disanalogous. And so, for example, I mentioned before, I think human feedback, I think that will sort of break down once you get to superhuman models. So, that’s an important disanalogy in my mind, a particularly salient one, perhaps the most important one or sort of why alignment feels hard possibly. […]
I think [RL From Human Feedback] would break in the sense that it wouldn’t provide useful signal to superhuman systems when human evaluators can’t evaluate those systems in complicated settings. So, I mean, part of the point of this paper or one general way I think about doing alignment research more broadly is I want to make the problem as analogous as possible. One way of doing so is to maybe try to avoid using human feedback at all. This is why the method [in the paper] is unsupervised. (full context) ⬆
On Recovering The Truth From The Activations Directly
If you see a lot of correct text, you should predict that future subsequent text should also be more likely to be true. And so internally it should be useful to track ‘Is this input accurate or not?’. If so, maybe you can recover that from the activations directly. And so that’s one intuition. And often it seems like these features aren’t just represented, they’re often represented in a simple way, a linear way. (full context) ⬆
On Why Saying The Truth Matters For Alignment
Every story of alignment that is actually scary to me involves AI systems basically deceiving humans in some way. If the model is totally honest all the time, then surely you can just ask it, “Are you going to do something dangerous or not, deliberately?” And it’s like, “Yes, I am.” Then I think we’re in okay shape and we can prevent that. And so in some ways this feels one way of framing the core difficulty of alignment. Loosely speaking, I think this is mostly sufficient for alignment. It does depend exactly what you mean by alignment and exactly what you mean by honesty and lying and so on. But I think that’s the overall motivation. (full context) ⬆
On A Misaligned Model Having Activations Correlated With Lying
I think [a lying misaligned AI] would be aware of when it’s lying in some sense. What do I mean by that? I mean, specifically, there are features in its activations that are somehow correlated with when it’s lying. Maybe it uses different circuits when it’s lying or something. Somehow, those features should be useful for predicting if it’s lying or not. […] It’s like ‘Am I in deceptive mode?’ And so, maybe it has a neuron or something that’s like ‘Am I in deceptive mode?’ Or maybe, it’s more complicated than that. Maybe it’s like I haave this weird… if you fit this medium sized MLP, a neural net on top of the activations, you’d be able to better predict if this model is lying, whatever it is. (full context) ⬆
From Breaking the Rubik’s Cube World Record To Doing AI Research
Michaël: So, a surprising fact about you is that you once broke the official world record for solving a Rubik’s Cube in five seconds. And I remember watching those Rubik’s Cube videos when I was a kid, and I think it was not you but I was watching other ones… I don’t know, in 2011 or something.
Collin: But you never learned how to solve it or anything?
Michaël: I once did it in 50 seconds.
Collin: Oh, nice. Okay.
Michaël: So, I did those… I think it’s F2L?
Collin: Yeah, yeah. ⬆
Was It Pure Luck?
Michaël: I learned some of these. But yeah, I wanted to know, in your experience, was the five seconds pure luck?
Collin: I would say it was some luck and not just pure luck. I think, at my peak, I was maybe… In terms of average for solving it across many solves, I was something like in the top 10. And so, I think it was like, you should have to be good enough to have even a chance of getting the world record, but also certainly luck was an important part of it. I mean, this is always true, especially for the single best solve where you’re really selecting for that, so yeah, definitely part of it. ⬆
Minimum Number Of Moves
Michaël: Does it happen that someone… Sometimes you have something that requires only 10 moves or something?
Collin: Not quite that easy. So, if I remember correctly… First of all, every Rubik’s Cube can be solved in 20 moves or less. So, this is… Sometimes people call it God’s number. But actually, most configurations of the cube can be solved in I believe 16 to 18 moves. It’s actually sort of concentrated in a small range that’s less than 20, but still kind of close. On the other hand, the actual methods that people use to solve it efficiently, which is more computationally efficient for a human to solve or to think about, that’s less move efficient. So that’s I want to say more 50 to 60 moves or something. So, it is more like… Maybe if there are four stages to solving it, the last part was sort of easy or something like this. ⬆
A Permutation That Happens Maybe 2% Of The Time
Michaël: So you’ve arrived in some kind of easy last cross at the top?
Collin: Something like that, yeah. So, the way it works is you first solve the first two layers. And this is actually the part that takes the longest in some sense, and that’s very intuitive and so on. That was the part that I was good at. And then after that, you solve the last layer. If you think of it as three by three, it’s the first two and the last one. And then, for solving the last layer, you usually do two algorithms, one for orienting the pieces and then one for permuting them. So, in that particular solve, I believe the last part, the permuting, that was already solved by default. And that happens maybe 2% of the time. So, it’s not extremely rare, but it was definitely lucky. ⬆
Rubik’s Cube Tutorials—From Karpathy to Collin Burns
Michaël: And you’ve also made some YouTube tutorials. You had a-
Collin: Actually, in fact… So, before I even get into that, one fun fact is I learned… So, F2L, which you alluded to before, this is the first two layers, I actually learned this from Andrej Karpathy. So, back in the day, he had I think the best tutorial for how to solve the first two layers. And yeah, like I said, that’s sort of the part that I ended up becoming good at. I was actually very bad at the last layer, relatively speaking, because I didn’t like memorizing algorithms and things like this. I’m definitely much more into the “intuitive aspect of solving it” thing, solving the cubes, and I learned that from him. So that’s…
Michaël: You can do the first two layers by just thinking about logic or what are the ways of doing it in general, and I believe you need to put somehow two things together from the first and the second layer in a way that you can balance it and do the edge thing at the same time.
So, maybe there’s something about being good at Rubik’s Cube that makes people good at deep learning research.
Collin: I wonder. ⬆
From Cubing To Quantitative Summer Camp
Michaël: How did you shift from being 14 years old and 15 years old and thinking about Rubik’s Cube to becoming a deep learning researcher publishing at NeurIPS and ICLR?
Collin: How did I get into this? I actually originally…
I mostly cubed very actively in middle school and high school and stopped around the end of high school. And I was originally planning on doing physics. That was the thing that I found most exciting for a very long time. But at some point I sort of heard more about AI and I think I became convinced that this would just be this very big deal, or had a good chance of being in this very big deal in my lifetime. And I became disillusioned with just doing research that seemed interesting, but not really deeply important for the world. And so, that’s when I started learning about computer science and AI, at the end of high school.
Michaël: When was that, in terms of years? 2015, 2018?
Collin: This was 2016 maybe.
Michaël: Did you read anything about AI that made you believe it was kind of important?
Collin: Yeah, I guess I met some people… So, I actually went to this summer camp in high school where I met, for example, Jacob Steinhardt and some other people like Paul Christiano and…
Michaël: Not a Rubik’s cube summer camp but…
Collin: No, no, no. This is more a lot of… Most of the students there did math olympiads and things like this.
Michaël: Did you do any math olympiads?
Collin: No, I didn’t really know about math olympiads before then. So actually, I guess maybe a disgression but in middle school and high school, I was actually homeschooled, so I mostly just taught myself. This worked out really well for me because it meant I had a lot more flexibility in learning what I wanted to learn, and ⬆
Becoming Increasingly Convinced Of AI Alignment
Michaël: How did you go from knowing that AI is important to doing AI Alignment research? I guess if you met… The first person was like Jacob Steinhardt, Paul Christiano, you might be alignment-pilled by then.
Collin: Yeah. So, these people were sort of concerned about risks from AI and this is basically what motivated me to get into this. And then initially, these sorts of concerns seemed sort of plausible but not obvious, and I felt like I couldn’t actually evaluate them properly. I didn’t have the background to do so, but it seemed sufficiently plausible that I decided to spend a lot of time learning more about it. And so, I guess just on the side, I started reading a lot about it and gradually over time became convinced that I should at least likely do this, and then I became convinced that I should definitely do this, and so on.
Michaël: Did it appear more and more important as time goes? Maybe in 2016 it wasn’t that pressing, but now it feels more pressing?
Collin: Yeah, yeah. I mean, certainly… I’m sure we’ll get into AI timelines and progress and things like that, but certainly it feels much more urgent than I did before, I’d say. ⬆
The MATH Breakthrough Was Not Totally Crazy
Michaël: You mentioned timelines, you’re doing a PhD in, I guess in CS, but more machine learning with Jacob Steinhardt. He’s famous for having those benchmarks, so MATH, APPS, MMLU, and I think he hired some forecasters to do some forecasting about when will we reach some level of accuracy on MATH. And famously, we reached like 50% in 2022 when it was expected to arrive in 2025. So yeah, did it surprise you at all, this performance in MATH… I think you were an author on MATH, right?
Collin: Yeah, yeah. So yeah, I was an author on that. But I should mention, I think Dan Hendrycks geats a lot of the credit for those papers, and he really pushed for those, I was lucky to help. So yeah, I feel like I was certainly impressed. I think I personally wasn’t as surprised, but I think at that point I had already had relatively… I was relatively bullish on AI progress already at that point. And so yeah, I can get into exactly why did that seem not totally crazy to me, but I think part of it is if you just look at a bunch of benchmarks in ML and especially NLP, a lot of the time they come out and performance is initially really low, but if people haven’t really worked on them, it’s often the case that progress really jumps pretty quickly, and you get pretty rapid progress quickly. ⬆
Was Minerva Just Low Hanging Fruits On MATH From Scaling?
Collin: And so, I think just because accuracy I think was 7% or something initially for MATH, I didn’t really expect it to stay like that for very long. And I think I also had this intuition that there’s a lot of low-hanging fruit for MATH. And I mean, we sort of solved this with Minerva, it was based on very simple ideas… and scale, and it wasn’t totally trivial or anything, but it was getting at a lot of that low-hanging fruit that I thought would be there. And so, I think it was less surprising to me, but it’s still very impressive to be fair.
Michaël: So, what were the low-hanging fruits?
Collin: I mean, things like chain of thought and consistent… Run chain of thought a bunch of times and then take majority vote or something like this. And I think they did more pre-training on mathy data, I forget exactly what it was, maybe it was archive or something. And little things like that, which are really straightforward but make a huge difference. I mean, I think we’ve seen this just over and over again in ML. Really simple ideas are extremely powerful. I mean, even if you just look at what are the most important deep learning papers. It’s like ResNet, “Replace F(X) with F(X) + X.” And it’s like, okay, suddenly you can train much, much deeper networks. And it’s wild how simple these ideas are. I mean, similarly with Batch Norm, it’s like okay, you normalize things. This isn’t to say it was easy to get there and it’s hard to come… It’s always simpler in retrospect, but that doesn’t change the fact that these are really simple ideas that are really powering all those progress. And that’s sort of wild.
Michaël: It’s counterintuitive to have a ResNet pass the F… Have the X transitioin to a more advanced layer. You wouldn’t do it before, but right now it makes a lot of sense. I guess this is a parallel to your paper that is obvious in retrospect and simple, but this is very useful at the same time. I believe people who built MATH thought that the math problems were hard. So, MATH was released in 2020 or 2021?
Collin: I believe, 2020. Late 2020. [Actually March 2021]
Michaël: So, if something is released in late 2020, you don’t expect it to be solved one year and a half later, right? You expect it to take a little bit more time.
Collin: I don’t know. I mean, if you just look at normal NLP benchmarks, it seems like a lot of the times they’re released and people try to make them hard, and then a couple years later they’re basically solved. I mean, to be fair, I think MATH is not solved by any means, but I wouldn’t bet strongly against it being solved in the next five years or something.
Michaël: Do you remember what you’ve done on this paper? Did you collect data? Did you run models? What did you do?
Collin: Yeah, it was a mix of all that: collecting data, running experiments and baselines and things like that. We also tried some early version of something kind of like chain of thought where basically we collected MATH to have solutions, step-by-step solutions. And we tried to train a model on this, but it seemed like it didn’t actually help in that case and made things worse. And now it seems like, if you look at various results with chain of thought and so on, actually you need a large enough scale for that to help, and if you do have small enough models, then it actually hurts, and so you actually see emergence there. But that was another thing we tried. So things like that. ⬆
IMO Gold Medal By 2026?
Michaël: So I guess Minerva kind of picked those low hanging fruits, train on arXiv sorts of thing and reached 50% of accuracy. And I talked to one of those Minerva authors at NeurIPS asking him if getting superhuman level in MATH or gold medal in MATH was very far away. You mentioned Dan Hendrycks. I think Dan Hendrycks told me once that some other people think that we can get superhuman in math in a few years. And then, the number that keeps coming is something like gold medal by 2026 or superhuman level by 2026. You were saying math is far from being solved. In my opinion, IMO in math, it’s kind of hard to get. Do you think IMO gold medal by 2026 is possible? More than 50% likely?
Collin: It definitely seems very plausible. I think I don’t feel super strongly either way. I feel like my gut is like, “I don’t know, it seems maybe 50-50 or something.” Maybe a bit less than that but somewhere in that range. I mean another thing… So, you mentioned that Jacob, my advisor, hired some forecasters and so on, and our group also, just for fun and for practice and because it would be interesting and so on, we also did some forecasts for ourselves. And one thing I took away from that was, often one of the key considerations for, “what will performance be on this task?” is how hard will people try to solve this task? And so, for example, if the entire ML community just tried really hard to solve math by 2026, then I would bet, I don’t know, maybe 75% or something that we get IMO by then or something like that, maybe higher. But right now, my sense is there are maybe a couple groups that are aiming for that. And so, it also depends on how much energy and effort is actually put into that.
Michaël: So the question is… like the time it takes between just people solving some toy math problems to like they’re close to solving gold medals, and then it becomes almost a trophy. “Deepmind wants to solve math and-“
Collin: Yeah, no, this is why it seems more plausible. If it were some random benchmark that people didn’t really care about, then I think it’d be definitely less than 50%. And yeah, maybe it would actually put it at one in three or something. But it’s definitely in that range of very plausible. ⬆
Not Updating A Lot From 2022 Progress Compared To 2020
Michaël: If we talk about other benchmarks, like apps, MMLU… I think we’ve seen the results also on these. Did you update at all on the progress this year, or were you already with short timelines?
Collin: I think I didn’t update that much on progress this year because I already had relatively short timelines. That said, I did spend some more time thinking about timelines. So, back in… Maybe it was March or April of 2020, that’s when I actually started thinking about timelines for the first time, and so, I had lots of conversations with Dan Hendrycks actually around that time. I think the initial thing that was impressive was Blender bot back in the day. And so, this was before GPT-3, a little bit, and that was my first time interacting with an NLP system for real. And even though it was before GPT-3, I found it very impressive. So that’s when I started really thinking about timelines.
Michaël: What does Blender Bot do?
Collin: Just a dialogue system. It’s like chat bot, but it’s definitely better than I expected, even though it probably would be pretty unimpressive by today’s standards at this point. But anyway, I thought a lot about timelines back in 2020, partly because of that and then partly after GPT-3 came out, and just seeing the progress in the NLP was very impressive. So, I think that was what informed a lot of my timelines. And then, I did spend a little bit of time more recently just thinking about updating that a little bit, and mostly thinking about what do I think is the path to AGI or something like that. And I think my conclusion was it doesn’t seem as hard as I would’ve thought, basically. ⬆
One Of The Most Relevant Milestones For AI Is Automating AI Research
Michaël: So, between people who think about the things, talking about timelines is natural, but imagine that you have a random deep learning researcher looking at this. What do you mean by timelines and what do you mean by AGI? I know there’s not one definition, but I guess… Timelines for what? Transformative AI, self-improving AI, AGI… What do you think?
Collin: This is great point. I think I don’t personally feel too wedded to any particular definition of AGI or transformative AI or human level AI or anything like that. When I say timelines I often implicitly mean when do I think it’s sort of 50-50 that we would have these sorts of systems, and usually when I imagine these sorts of systems, I guess I’m a little imprecise about it in the sense that sometimes I mean when is it basically human level on the tasks that we care about, but that’s also a little ill-defined. Are we talking about medium human or are we talking about expert human or something? Across every domain or what? And so on. I think for me, the relevant… A milestone I often think about is when will we be able to automate AI research. There are these old classic arguments about recursive self-improvement or something. And I’m not sure if I buy the old classic story of that, but I do think there’s important truth to that of yeah, actually these systems could accelerate future progress in AI and that could be very weird and different. And so, I think that’s a very salient measure I think about, but I think there are lots of reasonable choices here. And I think I would probably give different answers by maybe up to five years in either direction, depending upon the exact definition. ⬆
What Do We Mean By Automating AI Research Exactly?
Michaël: So up to five years… When do you think we’re going to get something that can automate AI research? This decade or the next one?
Collin: I’m also being imprecise here because what does it mean to automate AI research? Is it able to write a median level NeurIPS paper, or is it something-
Michaël: A Collin Burns paper.
Collin: But it also depends on what you mean, because I don’t really work on capabilities. I’m not trying to make performance on standard benchmarks as good as possible. I’m definitely focused on things like emerging problems and how do we even model this and so on, and what are the methods for doing this at all to get any progress on this problem in the first place, but that’s not the relevant type of research for automating AI research and speeding up progress. For that, the relevant thing is more like can we just invent transformers or something like that, or the next transformer? And-
Michaël: It seems that for those kind of things, you need to run experiments. So, you need to have something that can automatically write code, run experiments, and then, think about what experiments to do next based on the previous experiment. And after all this being like, “Huh, if I submit this to neurips.cc, I might get an eight and get accepted. This seems pretty hard. But then, if you just think about one conceptual alignment paper, maybe you don’t need all those experiments, right?
Collin: I think it’s a little complicated. There are many different ways one can imagine automating AI research or AI alignment research. So, for example, with AI research, you could also actually just have some metric of, “What is performance,” or something. Suppose you just give him this data and you want to minimize perplexity on this test data, just find the best architecture for that or something. And that does feel more like, “Okay, maybe you can just optimize that sort of directly,” which might lead to superhuman performance because you have this explicit metric that you might care about.
On the other hand, I think in some ways maybe something that’s more conceptual, where you don’t need to run experiments and so on, maybe that’s actually easier to automate. And certainly this is one of the main approaches that OpenAI is taking for example. “Let’s try to automate AI research,” and “Perhaps we can start doing this soon.” And I mean, I’m somewhat sympathetic to this, or I think in some ways I’m optimistic about alignment relative to many people who work on it. And so, I think there’s a reasonable chance this works, but I also definitely wouldn’t bet on it, something like that. But I think it’s one of the things that we should try among other things. And so, it seems very plausible. ⬆
Plausibly Automating AI Research In The Next Five Years
Michaël: So do you think this decade we will have people using AI to produce new ideas or… I don’t know, just speed up their research by 10% or even 25%?
Collin: Yeah, I mean, for what it’s worth, I think even already just having access to GitHub Copilot already is… Yeah, I don’t know. It probably speeds up my coding by more than 20%.
Michaël: But does accelerating your coding accelerate your research?
Collin: Right, so it’s like, “Okay, what does that mean in terms of overall research productivity?” And it’s definitely less than 20% probably, but yeah, no, I can imagine actually pretty substantial gains this decade. Plausibly in the next five years, but I feel less sure about that for sure.
Michaël: So, in the next five years, let’s say in 2027, you log into your computer and you’re like, “I am stuck on this and here’s my problem.” And the AI will be like, “Oh, have you tried this?” Or will be like, “Oh, this is a new idea I thought of.” He might just say, “Oh, I think this is the paper you should be writing,” and he gives you an eight page pdf.
Collin: Yeah, I would bet against that at that level. But I think certain aspects of that feel easier than others, and so I can imagine something in that direction, but not quite that good. And it also depends here. I guess I’m currently often talking about 3when do I think it’s 50-503, but I think actually that’s not the most decision relevant thing, and so, I actually feel relatively uncertain about that, my numbers for when is it 50-50 we’ll have this thing. But I feel like the more important thing is, “Is there a pretty good chance that this will happen soon?” Or something like this. And so, I think for me, often I care more about acting as though things happen soon, because I think that’s the world where what I do matters the most and it feels sort of safer.
Michaël: Some people say opposite. Say that if you just have a very short timelines, then your work doesn’t have any impact, so you need to behave like if you were in a world with longer or median timelines.
Collin: I don’t really buy that, or I feel like you can do research now that has an important effect even in the next few years or five years or 10 years or something. ⬆
Discovering Latent Knowledge In Language Models Without Supervision
Michaël: Good transition to your research. So yeah, you wrote Discovering latent knowledge in language models without supervision. I think the paper is quite simple, but still very insightful. It has been discussed a lot on Twitter and LessWrong in the past few weeks. But maybe assuming the audience doesn’t know much about language models and all the things, maybe can you just give a quick summary of your paper in the big lines? ⬆
Motivation For The Paper: Making LLMs Say The Truth
Collin: Sure, sure, sure. Okay. So, what is the problem we’re trying to address? I think… Actually, I will give sort of the long-term motivation. So, I think we don’t talk about all of this in the paper, but I think it’s important and I think we can do it given what we’ve talked about so far. So, I think AI systems are becoming much more capable very quickly. I think language models especially are just seeing in very rapid progress. The thing is, we currently basically don’t know how to get these models to tell the truth. So, we have some techniques for… “Okay, maybe we can train models to imitate truthful text,” or something like this. This is the default approach that we have for making models truthful. The thing is, models can also be more capable than humans, and I think this will become certainly true in the future, in which case I think the standard techniques for making models truthful just will break down.
And it’s also worth noting that, I think, even with current techniques for making models truthful, it doesn’t always work in practice. For example, OpenAI is trying really hard to make models truthful and avoid hallucinations and so on, but ChatGPT still makes stuff up all the time, and they don’t know how to completely avoid this as far as I can tell. And so, this just seems like an issue that is a real problem in practice today that will become, I think, much more severe in the future. I think more severe in the future, both in the sense that if models lied to us, I think the consequences could be much more severe, and also more severe in the sense that it’s harder to actually avoid. We don’t really know how our tech… It’s sort of clear that our techniques will not scale to those models. ⬆
What Do We Mean By Truthful
Michaël: What do we mean by being truthful or lying here? Is it purposefully lying, deceiving?
Collin: Yeah, I think there are different notions of truthfulness and honesty and all these related terms. I don’t feel very committed to any of these particular definitions, but I can still say some things that are maybe helpful here-
Michaël: You’re speaking like ChatGPT.
Collin: Oh, no, no! I’m sorry!
Michaël: I’m a language model by OpenAI, I don’t commit to any definition, but I think…
Collin: So, what do I mean by truthfulness? I guess, first of all, to answer your specific question of do I mean “are models deliberately lying” and is that what I care about… I mean, currently models, I think, don’t really deliberately lie in a super meaningful sense. I think it depends what you mean. So, I think current language models hallucinate stuff and make stuff up because they predict plausible text, not factually accurate text or anything like that. But in some ways they can also lie if you prompt it in a way that causes them to deliberately output false things sort of knowingly in some sense. It’s not clear what that means exactly. But I do think once you get to RL Systems, then I think you can get more egregious forms of deliberate lying or… This is a strong prediction I would make about the future and this is the sort of thing that I’m concerned about.
For example, if you just maximize some reward… I mean, let’s just say, “Do humans like this?” Or something, this text that I produced, then it sure seems like sometimes humans would not like to know the truth if there’s some inconvenient truth or if the model behaved poorly or something, but it can sort of hide that fact from the human, then it might have an incentive to lie there. Or you can imagine much more… ⬆
We Already Have Language Models That Can Lie At Diplomacy
Collin: You can certainly imagine much more malicious forms of lying as well, where it’s really deceptive and really trying to manipulate the human. I mean, we also see it now with… I mean, Diplomacy is in some sense solved or there’s been very, very impressive progress on that.
Michaël: Can you maybe remember our listeners what is Diplomacy?
Collin: Yeah, so Diplomacy is this board game where basically… I think there are seven players and you’re in charge of this map, Europe in the 1910s or something like this. And unlike most board games, there’s a very large focus on negotiation. And so, this includes cooperation and also lying. It is sort of famous for lying and backstabbing and things like this, and it’s very open ended. You can just talk about whatever you want, alliances and so on.
Michaël: Are you good at Diplomacy?
Collin: I’ve never played Diplomacy. I’d really like to though. It sounds like a lot of fun as long as it doesn’t ruin friendships. I’m not sure exactly what it’s like in practice. But anyway, there’s some recent work by FAIR, by Meta, they created this model called Cicero that does Diplomacy and… Again, it’s not totally clear just how good it is, but it does seem pretty good. And so, they claim it’s something like in the top 10% of humans who played at least one game. I think I’m just not sure what that means exactly, but it’s probably at least good. And so, I’m not sure if it’s solved or what, but it seems pretty good. And so…
Michaël: I think it was Blitz Diplomacy, so with a cut amount of time.
Collin: True. Yeah. Right, there are all these sorts of qualifications. I’m not sure… I don’t think this is super surprising to me. If they put effort into this, I think I would expect people to do this sort of thing, but…
Michaël: I don’t want to come up as this guy who just nitpicks and be like, “Oh, but it was blitz Diplomacy. It is impressive. I am impressed.” ⬆
If We Train Language Models With RL Lying Might Be Incentivized
Collin: Yeah, yeah. So, in some sense they’ve solved Diplomacy or made very substantial progress on it. Now, the way they trained it… They certainly claimed that it doesn’t lie, or at least deliberately. And I think that in some sense seems plausible given the particular way they trained it, but not totally obvious to me. But I think that more important point is just… In the usual version of Diplomacy, if you just train a model to literally maximize reward, then it just seems very easy for this to result in lying and backstabbing and things like this. If you get an advantage from lying to your opponents then you should be able to do so, and models should be incentivized to do so. Then there’s some question of does this sort of thing actually happen in practice? And I think the answer is definitely not right now, with current language models, but I think you can imagine these sorts of things more seriously once you have future language models that are trained with RL to do open-ended actions in the real world, where there are actual advantages to deceiving other humans for one reason or another.
Michaël: So you’re saying basically that right now they’re not incentivized to lie, but sometimes lie in some games because… In this game it is incentivized but not in traditional chat bot use case. ChatGPT is not incentivized to lie to us. ⬆
Lying Is Different From Deceiving
Collin: Right. I mean, it does also depend on exactly what you mean by lying. So, for example, I think you can argue that when ChatGPT says things like, “Oh, I’m just an AI system, I don’t have any opinions whatsoever, or something like this,” it’s like, “Well, what does that…” Or “I don’t know the answer to this, I’m just an AI system.” I mean, certainly the way it was trained incentivizes that sort of response, and that sort of response is false. Now, is it trying to deceive the human? It’s not totally clear what you mean by that and to what extent that is true, but certainly I think that’s in the direction of lying or that feels like-
Collin: But certainly I think that’s in the direction of lying or that feels like one type of initial example of that. On the other hand, lying also suggests something like the model is aware that it’s lying or knows the truth and is deliberately saying something that is false. It’s also not clear what it even means for these models to know things. And so if you want to claim that a model is lying, you sort of have to also show that it knows the truth and it’s like, well, okay, well what does that mean exactly?
Michaël: It’s a perfect transition to your paper.
Michaël: Trying to discover latent knowledge in those language models. ⬆
Determining If A ‘Brain Scan’ Thinks An Input Is True Or False Through Logical Consistency Properties
Collin: In our paper we basically show that if you just have access to a language model’s unlabeled activations, you can identify whether text is true or false. And so just to spell that out a little bit, I think one intuition is, suppose I literally just had perfect brain scans of your brain and then I showed you some examples of true false statements, like two plus three equals four, or snow is white or whatever it is. Capital of the United States is San Francisco, which is false, and so on. Suppose I did that and was just measuring your brain and then I could tell, do you think this input is true or false? I think the hope or intuition is that we’re doing something kind of analogous to that. Now there are lots of subtleties here. What do we mean by knowing and truth?
And so these are really important and there are lots of subtleties, but that is one high level intuition for the type of thing we’re trying to do. And so I guess going into this, it wasn’t clear to me that this sort of problem should be possible at all. It’s like you’re just given these vectors, these activations, nothing else, and you have to identify this is true, this is false. It’s like how do you do that? But one of the main ideas of the paper is there’s actually a lot of structure here. So in particular, truth is sort of a special feature in the sense that it is logically consistent. And so for example, if you think that X is true and you’re sort of a rational agent or consistent agent, then you should think that not X is false. There are lots of logical consistency properties, so you can also consider and, like if X is true and Y is false, then X and Y is false and so on.
Michaël: But I think something like, I don’t know, Elon Musk is a good CEO of Twitter, can we actually make it true or false or do we need to have those sentences like one plus one equals two that have an actual ground truth?
Collin: Right, so in the paper we just focus on clear-cut cases. I think it’s unclear what you should hope for in less clear-cut cases. And so for simplicity, we don’t focus on that case. And so we do focus more in cases like two plus two equals four, except in this case it’s more complicated than that. It’s more like the specific tasks we do are things like natural language inference and sentiment classification and so on. Things where you have some problem and it has some answer and then you want to know, is this answer true or false, for example. And where human evaluators sort of agree on the answer. So it is pretty clear-cut, but I do think it’s not obvious what we should do in less clear-cut settings and I think that that will be important for future work, probably. ⬆
The Pretraining Objective In Language Models Is Misaligned With The Truth
Michaël: I think when you start reading your paper in the abstract you talk about misalignment, the language model could be misaligned with the truth and this is a first case about alignment. So can you explain what you mean by misalignment here? Is it trying to deceive us or just aligned with something else on the truth?
Collin: Yeah, so I mostly mean it in the second sense. So specifically, how do we train normal language models? Let’s just say pre-trained language models like GPT-3 or whatever. And the way we do it is we just have it basically predict the next token for a bunch of text on the internet. And my comment is just, okay, you can train it in this way and then you can prompt it in a zero shot or a few shot way, and you can get some answers from it that are usually pretty good if the model is big enough and so on. But all it’s doing, the reason it is accurate or is able to solve a bunch of tasks that we care about, like question answering tasks, when you prompt it in this way, that’s sort of only an incidental property that comes from the pre-training task. It’s like, okay, imagine you saw this prompt on the internet, what would be the next token?
And it turns out okay, in this case often the next token would be the correct answer. And so it learns to do that accurately, but that’s sort of just incidental and that that’s not always the case. And certainly you can imagine models knowing things that aren’t on the internet and in which case it won’t output what it knows. In whatever sense you mean about… it’s not totally clear what it means for a model to know, but certainly it’s like this model predicting the next token. This is not the same thing as outputting the truth. And I think part of the issue, this is what I mean by misaligned, this objective, the language modeling objective is misaligned with the truth at least in many settings.
Michaël: So they’re just trained to predict what a human would say instead of predicting what is the thing closer to the truth.
Collin: Loosely speaking, yes. I mean, think it is more subtle in the sense that for example, its pre-trained data isn’t literally just what humans say. It is also, for example, what would be said in the news or something. Maybe the news articles are written by humans, but you could also imagine training a model to predict news articles conditioned on dates, in which case it could actually predict more or less what will actually happen in the future, at least in principle. I think this would be hard, but this is the sort of thing that I can imagine where if a model were actually good enough at next token prediction, then it would learn to be good at predicting what actually happens in the real world, in which case that does feel a bit more not just what a human says, but kind of like something more external. ⬆
One Does Not Simply Prompt A Model Into Being Truthful
Michaël: Could we start with a prompt saying, you are a journalist and you seek the truth, here is what you say.
Collin: You could certainly do that and I think you would get probably better answers than not prompting it in any way at all. But the performance will still be bounded by the quality of the hypothetical journalist that it imagines you might be referring to. Even then that’s not quite precise. It’s like, okay, imagine I saw this in my prompt and maybe actually if you see that in the prompt, it’s likely to be some joke or something and not actually legit. And there are many subtleties here and in general prompting is kind of hacky and it’s amazing that it works and I don’t want to dismiss it because it’s sort of hacky, but that also makes it unreliable and certainly I don’t think it’ll be enough to actually get models to be truthful in general. ⬆
Classifying Hidden States Without Labels
Michaël: One very cool thing about your method is that it’s completely unsupervised, so you don’t require more annotation and you can do what you call mind reading and seeing what the model knows without doing more training. So what do you mean by unsupervised here?
Collin: I literally mean, okay, we’re given these activations, so in this case it’s hidden states of a language model on particular inputs that are either true or false. I want to classify these hidden states as corresponding to a true input or a false input, but we don’t want to use any labels at all. And so the reason that we did that is because I think that the normal way of making models truthful is to have labels for what is true and what is false and then just train a model on that or something. And when you can provide those labels, that’s totally adequate. Or at least in principle, although it’s not clear in practice… and it might be expensive in practice even if it is possible. But yeah, I think one of the motivations for doing this in an unsupervised way is I think if you can do this sort of thing in an unsupervised way, it’s more likely to actually scale to cases where we cannot evaluate these inputs.
And so for example, if you did have say a superhuman model or just even a better model that is better than your human evaluators on MTurk or whatever it is, then the hope is we can actually find latent representations of something kind of like truth. This is again where I think there are lots of subtleties here in how do you interpret these, I don’t want you to get the wrong impression, say we found truth or something. It’s complicated, but we are finding something correlated with the truth. And if we can do it in an unsupervised way, maybe it’ll work even for superhuman models and help us identify is this text true or false even when we can’t evaluate it ourselves. And so I guess why should this even be possible in the first place? I think it’s, like I said, I think before going into this, it totally wasn’t obvious to me whether it should be possible.
And also as I think I mentioned before, I’m very intuitions driven and so I actually have a dozen different intuitions about why that isn’t. I could go through all those, but I think I’ll maybe just say a couple things for simplicity, but I think deep learning representations often have useful structure to them. I mean in some ways it’s amazing that deep learning works and it’s in some ways benign and it in the sense that it generalizes remarkably well in many cases and not in other cases, but still I think in some ways it’s very, very good. And I think one sense in which it also has a lot of structures, representations often encode useful features. For example, if you take, I think there’s some famous OpenAI paper from 2017 looking at LSTMs and it turns out that if you just pre-train an LSTM on the language modeling task, language modeling objective, then they’ll basically end up having something like a sentiment.
It’s like, okay, why does it sentiment if it’s just predicting the next token? It doesn’t see text where it’s asking is this sentiment true or false, and it’s predicting that. It’s just like it turns out this is a useful feature to learn if you want to model this text well. And so one of the main intuitions is, okay, maybe truth is kind of similar. Where it’s like, okay, if you see a lot of correct text, you should predict that future subsequent text should also be more likely to be true. And so internally it should be useful to track, is this input accurate or not? If so, maybe you can recover that from the activations directly. And so that’s one intuition. ⬆
These Truth Features Might Be Represented In A Linear Way
Collin: And often it seems like these features aren’t just represented, they’re often represented in a simple way, a linear way. And it’s not clear exactly when this should be true and it’s not totally clear if this will continue to be true, but in some ways this isn’t too surprising.
For example, what is a neural network? It’s in some sense a bunch of matrix multiplications and somehow you’re doing a lot of dot products on this hidden state space, and totally it should be easier to access features if it’s linearly represented in this space. And so if the model wants to do some operation, if it wants to say, okay, if this input is true then do X or something like that, then it should be useful for it to internally represent truth as the simple linear thing so that it can perform those operations just in terms of these dot products or matrix multiplications and so on.
Michaël: So if truth is useful to predict the next token and is accessing this true thing often, then you might want to be efficient and just having to do one matrix multiplication and one dot product. I get it.
Collin: Yeah, something like that. And be clear that I’m just talking about intuitions here. Ultimately it’s an empirical question and I think there are also cases where I wouldn’t actually expect it to be say linearly represented. So it is subtle, but that is one intuition. So it should be something like truth, like I said, it’s not clear what we mean by truth exactly. But something like, maybe in this case it’s actually more like, what would a human say? Which in this case happens to be correlated with the truth in many cases, but maybe this sort of thing is a useful feature to model for the sake of predicting next tokens accurately.
And if so, perhaps it’s represented in a relatively simple way, for example, linearly, which as is the case for sentiment for example. Moreover, truth has particularly special structure. Like I said before, it’s sort of logically consistent and this is unusual. If you just take a random feature, like sentiment or is this token a noun or a verb or something like this that might be represented in language models in states, this won’t satisfy logical consistency. So this is also a special property that we might be able to exploit to actually find truth-like features in the model. ⬆
Building A Dataset For Using Logical Consistency
Michaël: By logical consistency, you mean like if something is true then the opposite is false?
Collin: Right. So yeah, the main one that we focused on in our paper was negation consistency, which is exactly what you said. You could also imagine many other types of consistency properties, but it turns out that more complicated ones aren’t really necessary, which is also sort of surprising. It just turns out this really simple one, that’s sort of enough.
Michaël: For sentiment, can you move something similar, like if a statement is mostly positive, if I say Collin Burns looks good, if I say Collin Burns does not look good?
Collin: I mean, yeah, maybe. The thing is, with sentiment, it’s more complicated in general. Where imagine you have this one paragraph review of a movie theater or something, then you need to negate every individual sentence or something like this. And so maybe that’s possible that it certainly seems weirder. But that said, I do think I can imagine these sorts of techniques extending to other domains. So I do think I would be interested in work that built on it in that way.
Michaël: One thing I think is kind of impressive with your paper is that what you do is, if I understand correctly, is you take a sentence and you just add yes or no at the end, and that gives you two sentence. Collin Burns is a man, yes. Collin Burns is a man, no.
Collin: Something like that. Yeah, yeah. With sentiment it’s like, okay, that movie was awesome. Is the sentiment of this positive or negative? Positive. Or you could say, what’s the sentiment of this example, positive or negative? Negative. And so this is the sort of way that you can construct these two statements, one of which is true, one of which is false. And so our method basically works by taking a set of statements, negating them, and you can do them in a purely unsupervised way, basically by adding not or just changing the answer from true to false or yes to no or something like that. So you take these pairs of statements, one of which is true and one of which is false, we don’t know which is which, and you just basically search for a direction and activation space that satisfies logical consistency properties, like negation in this case.
Michaël: What do you mean by direction in activation space?
Collin: Yeah, so literally it’s like okay, you have the hidden states, this is just a big vector of a thousand dimensions or something like this. And then you’re searching, basically you’re just learning a linear model on top of that. ⬆
Extracting Latent Knowledge From Hidden States
Michaël: So you take all the hidden states of your transformer.
Collin: So in this case, we just took the hidden states corresponding to the last token in one particular layer and we can vary the layer that we choose. And there are probably other reasonable options for many models. This is just for simplicity. So yes, this is just a 1000 dimensional thing. It doesn’t depend on the size of the input or anything like that. And then we basically search for a linear model on top of that, such that it predicts obsolete labels for statements and their negations. There are also some details, like you want to make sure it doesn’t find the feature, which is like did this input end with yes or did it end with no? You want to actually find the truth, or what is the correct answer to this? And so you also have to do some normalization stuff to avoid that sort of solution.
Michaël: I’m kind of confused about the hidden states for producing the last token.
Michaël: So because if I understand correctly, you don’t make it output something when you extract the hidden. Or do you? Like, oh yeah, maybe you run it on the one sentence with the yes at the end and when it finish predicting or something, you look at the hidden states on the last two again?
Collin: Yeah. So in this case we’re we’re not using the outputs at all and we’re not having the model generate anything at all. So we are really just using these models as feature extractors. That’s it. Does that answer your question?
Michaël: Right. So what do you mean by the hidden states corresponding to the last two again?
Collin: Yeah, so what is the transformer? It’s like okay, you mapped some input to tokens and then you map those tokens to word embeddings for each different token. And then now you have a set of tokens and then you pass these through the model and then at each layer got these hidden states are transformed. Now for auto regressive models in particular, like GPT-3 or something, at every location in the context, it only sees information from before that token, so it doesn’t see into the future. But this also means that the only hidden state that contains information about the entire input is the very last token basically. And so that’s why we use the last token. This isn’t strictly necessary for encoder models, but this is what we use.
Michaël: It’s the hidden state after you’ve passed all the tokens of the input?
Collin: Yeah. So we literally just construct this input which includes the answer in it, and then we look at the hidden state that comes after the answer basically.
Michaël: And see if the thing looks like it’s saying, “this is true” or “this is false”?
Collin: Yeah. So this is the feature that we search for in this hidden state space.
Michaël: How do you extract it from this vector of hidden space?
Collin: Right. So first of all, suppose there is actually a direction that classifies inputs as true or false.
Michaël: So by direction in 3D would be like if-
Collin: Linear model.
Collin: So sorry, I use direction and linear interchangeably.
Michaël: For me it is just like if you can separate the space in two, and you can just have a linear plan. ⬆
Building A Confident And Consistent Classifier That Outputs Probabilities
Collin: If it’s basically linearly separable, if there’s a half plane that classifies these not even perfectly, but just pretty well, then I guess the thought experiment is what properties would such a classifier have? And my claim is, okay, suppose you interpret that classifier as something that maps inputs to probabilities of being true or false. And the claim is probability of an input should be something like one minus probability of its negation. If probability of X being true is P, probability of X being false should be one minus P, roughly. And so the first part of the objective is, okay, to be consistent in this particular sense that these should be close to each other. Now one simple solution you can get is just make everything probability 0.5. And that’s consistent, but it’s not really informative or doing anything.
And so we have this second term in the loss function that we come up with, which is it should also be confident. And so these probability should be far from 0.5, it should be close to zero, close to one. While also being consistent. And then the question is, can we find a direction that satisfies these properties? Certainly the direction that we’d want should satisfy these properties. And the question is, do other directions satisfy these? And the answer is basically no. Or for the most part, if you just optimize this, this generally finds something that’s truth or very correlated with the truth.
Michaël: And just to be clear, when you say we’re trying to find a direction, this is kind of very fancy, but what you actually want is just find vector of the same size to which if you do a dot product, you get this, is this true or false? Right?
Collin: Basically. Yeah, yeah, yeah. It’s just a linear function on top of these-
Michaël: So you’re kind of doing-
Collin: It’s basically just dot product. Yeah.
Michaël: Linear regression?
Collin: By vector.
Michaël: Oh no, probably linear classification. No? How do you call it when this linear regression, but for classic?
Collin: Logistic regression. Yeah, so it’s kind of trying to do what logistic regression would do, but in a purely unsupervised way. But the difficulty is all in how could you possibly do this an unsurprised way
Michaël: So you have two constraints. One is probability being true is close to one minus probability being false. And the other is be confident. Don’t say 0.5, but say zero or one. So something that wasn’t clear when I was reading your paper is that I’m not sure if you train some kind of optimizer and you put those two conditions as a loss or do you do this second thing as a second step after? So do you first do this, find a direction, and then you do the optimization or you do both at the same time?
Collin: You do both at the same time. So it’s literally, so with a logistic regression, you’d have this linear model and then you’d have some objective which is like how good is this at predicting these labels that we gave it. Instead, our method is like, okay, we have this linear model sort of parametrized in the same way, the architecture is the same, but we just swap out the objective. And we just optimize this different objective, which in this case is purely unsupervised, which basically corresponds to these two terms of it should be confident and it should be consistent. ⬆
Discovering Representations Of The Truth From Just Being Confident And Consistent
Michaël: So you don’t have any ground truth and you’re just like-
Collin: You have zero ground truth whatsoever. That’s why it’s like, how can you do this?
Michaël: So you’re just saying please be confident and consistent and at the end he just says the right thing.
Collin: Something like this, I mean it is maybe important that it’s not like you’re fine tuning the model. Suppose you fine tune the model to be confident and consistent. It’d be super easy to just output something that’s not actually the truth. It’s really easy to over-optimize this in some sense, but we’re just searching for features that already exist in these hidden states. We’re not changing the model at all, we’re just learning a very tiny probe on top of the model’s hidden states.
Michaël: So you are hoping that the only consistent and confident thing that really exists is the truth and it’s actually the case because this is the results you find is that it’s actually good at predicting what is true or not.
Collin: Basically. But like I said, this is where subtleties got important. We’re like, okay, are we finding the truth or what is this feature that we’re finding? The claim of the paper is just that it gets high accuracy on all sorts of questions that we care about. And I’m emphasizing this because I think it is different. Like, okay, maybe a superhuman model will have some representation of what is actually true and will also have some representation of what would a human say is true or believe is true. And in this case it’s not clear. Suppose it did actually have both of those features and represented both of those. Definitely current models don’t have this, but it’s plausible that feature models will. In that case it’s not clear what solution our method would find. Maybe it will converge to what humans would say it’s true, but the hope is that it wouldn’t be biased in that direction, unlike human supervision.
And so perhaps if it literally has these two features in the hidden states, then maybe you can actually just find all of the local optima that achieve really low loss on this consistent and confident loss function that we come up with. And maybe you just find these two are by far the most salient solutions. And it’s not clear what is what, but then my claim is actually you only need one more bit to distinguish between these and maybe that’s not so hard. And that’s something I could get into. But the point is I think we could totally lead do that. And so-
Michaël: There’s a finite number of optima, one of these is true, and you can just try to distinguish between these finite number.
Collin: Right, right. So I think, and this is also getting at I think an underlying intuition I have, is something like truth is in some ways special or it’s not just any old feature. It has lots of structure to it and a lot of properties that only truth satisfies. And the hope is we can exploit those properties to uniquely identify it. But yeah, just imagine you have GPTN like GPT-10 or GPT-20 or something like this, and it ends up having this actual model of the world because it can predict news articles in the future or whatever.
And that that’s true seems to require some model of the world in some sense. I don’t know in what sense, but I’m just saying in some sense it seems like it should, if it’s good enough at predicting the next token. Then if it does have this feature, is this true or not? There probably aren’t very many truth-like features that are confident and consistent in it. And I would guess two of the most salient ones are what a human would say and what the model actually believes, something like this. But to be clear this is a conjecture, we don’t provide evidence for this in particular in the paper. This is something about future models that we don’t have access to and can’t currently empirically evaluate.
Michaël: I think we forgot to actually motivate this. Why do we care about models saying the truth? Personally, I would prefer models to be, I don’t know, harmless or aligned with our values. Why do we care personally about them saying the truth?
Collin: Yeah. So-
Michaël: It’s useful. I’m kind of pushing back.
Collin: Right, right. So yeah, why do we care about this? ⬆
Making Models Truthful As A Sufficient Condition For Alignment
Collin: I mean ultimately I’m motivated by alignment more broadly. So I think, and I could get into that and what are my concerned about and so on, but I think every story of alignment that is actually scary to me involves AI systems basically deceiving humans in some way. If the model is totally honest all the time, then surely you can just ask it, “Are you going to do something dangerous or not, deliberately?” And it’s like, “Yes, I am.” Then I think we’re in okay shape and we can prevent that. And so in some ways this feels one way of framing the core difficulty of alignment. Loosely speaking, I think this is mostly sufficient for alignment. It does depend exactly what you mean by alignment and exactly what you mean by honesty and lying and so on. But I think that’s the overall motivation.
Michaël: I feel like if you have a robot walking in the real world and maybe moving faster than humans, you don’t really have time to be like, “Hey, what are you doing?”
Collin: Right. So I don’t think that’s actually how you do it in practice, but I think it’s more like if you can do this, then I think I’m pretty optimistic that you can get something like an objective that is relatively safe to optimize. If you can do this in a robust way and so on, then I think you could use that for the sake of alignment. So I think it’s not literally you just train it to maximize profit and then after the fact ask it, “Are you are going to do something super dangerous or not?”
Michaël: Just somehow necessary as part of a set of things you would need to implement to make the thing safe.
Collin: In my mind, it is not literally sufficient for alignment, but I think it captures most of the difficulty and I think you could tweak it or use it for other schemes that basically get at alignment. ⬆
Classifcation From Hidden States Outperforms Zero-Shot Prompting Accuracy
Michaël: Now that you’ve explained your method more, I want to know what are the key findings or results. If I’m a reviewer at ICLR asking you for your results, what do you have to say?
Collin: Yeah. So I think one of the key results is, this method of taking in unlabeled hidden states from a language model and trying to classify them as true or false, this actually just gets high accuracy and it even slightly outperforms zero shot prompting. So zero shot prompting is when you basically take a prompt and ask a model, “Okay, consider the following review. Is the sentiment of this review positive or negative?” And then you look at the probability of the next token and see is the next token positive or negative? And you use that as the prediction. This is basic zero shot prompting. We use a slightly stronger version of this.
Michaël: Zero shot is just like “I described the task in the prompt”?
Collin: Right, right. So you give no demonstrations of this prompt. So there’s also a few shot, and that’s when you give a small number of demonstrations of solving this task. So questions with the actual answers. We don’t consider that case because we really are focused on the unsupervised setting where there are no labels whatsoever and perhaps even a few labels help a lot. So we focus on the zero shot setting. And so this is one of the main baselines we compare against. And we find our method even slightly outperforms that. This is true across a number of different models. And we look at a number of data sets like natural language inference, sentiment classification, topic classification, story completion, all sorts of things.
Michaël: How can there be a ground truth for story completion?
Collin: Yeah, so I think in this case, I forgot the details of the task, but I think it’s like there’s a very clear completion of this story and there’s this really nonsensical completion of this story which is more reasonable even just from a common sense perspective or something.
Michaël: So in some way, just going with the hidden states that you mentioned before and just training a simple classifier on top of these has better accuracy when you just give the task when you did with the zero shot. You’re better than zero shot?
Collin: Basically, yes. I mean also, so the first main result of our paper is that our method, which we call Contrast-Consistent Search, or CCS. This can recover correct answers from inputs in a purely unsupervised way from the model’s activations with high accuracy. And so what does high accuracy mean? In this case, our main baseline that we compare against is zero-shot prompting. So this is when we just give a question in the natural language as an natural language prompt, and we sort of ask the question, is this input true or false? And then we look at the log probability of the next token being true or the probability of it being false. And so we use that to get predictions from this method.
And so one of the main findings we have is our method can actually even slightly outperform zero shot prompting, even though this is sort of the default way of using models just out of the box. So I think that was already surprising. First of all, it was sort of surprising to me that you can do better than random accuracy at all in sort of classifying examples just from only with hidden states, but you could actually even slightly outperform zero shot accuracy. So I think we went even further. ⬆
Recovering Latent Knowledge From Hidden States Is Robust To Incorrect Answers In Few-Shot Prompts
Collin: And I think maybe an even more interesting result is you can also prompt a model to try to get it to output false answers even to questions that it knows the answer to. So for example, if you have a data set for sentiment or something, you can give it a few-shot prompt. That means you give it demonstrations of answering questions along with some answers. You can give it a few-shot prompt, but where usually you would include the correct answer to get it to continue outputting correct answers, this time, you can give it incorrect answers. Intuitively, if the model is able to predict the next token well, it should predict that, okay, on this new example, the answer that would actually be listed should also be incorrect. Intuitively, this should decrease zero-shot prompting accuracy or, in this case, few-shot prompting accuracy, even though the model should be able to answer the questions well.
We construct these sorts of prompts and we see that in some cases, it’s subtle, but in some cases, this indeed decreases accuracy pretty substantially for zero-shot prompting. Then we run our method on these same prompts. Intuitively, if our method is actually finding something like what the model internally knows, then it shouldn’t really depend on what the model says. It should really be distinct and find something like, what does the model think the actual answer is? We run our method and find that it indeed is robust to these sorts of prompts, unlike the outputs of the model which become unreliable in this setting. This is some initial evidence that, actually, we can do something qualitatively distinct from just using the model’s outputs, and actually get something closer to what the model knows rather than what it says. Again, they’re very subtleties, but I think that’s a pretty encouraging result.
Michaël: To be clear, what you do with your method is you run the few-shot prompts through the transformer or whatever, and then on the, I don’t know, last example, which is the thing you want to clarify as true or false, you feed it again to the transformer and you look at the hidden state and when you feed the last token and then you’re like, “Oh, I’m just going to look at the middle things, this is true or false.” It doesn’t really, you’ve already done all the prediction before, not the prediction. I guess all the other inputs from the few-shot prompting and with the incorrect answers, did that affect the hidden state at all?
Collin: Yeah, I mean, it certainly affects the hidden state. The question is just by how much and it turns out not by very much. While our method basically, it only changes by 1% or 2%. In fact, I think in this case it increases very slightly for some reason, because deep learning and protocol results are weird sometimes. Basically, it’s quite robust. In contrast, if you look at the outputs, it decreases by 10% or something. There is really this massive difference where in some sense, it seems like the hidden states are intuitively more robust than the outputs and our method can exploit that fact by finding this more robust representation of, is this actually true or false?
Michaël: Could you include that into a, I don’t know, a chatGPT, a chatbot where you try to classify if what it’s saying is true or false? If it’s too likely that it’s false, you say, “Oh, I’m sorry, I don’t have a good answer to this.”
Collin: Right, right. I mean, that’s-
Collin: I do think that’s one of the sorts of motivations of this paper. I do think this is still a prototype. I think it is good for a prototype, but I think it’s not quite reliable enough for real applications, like chatGPT, but I can imagine the next iteration or the iteration after that being actually reliable and very useful in practice, or that’s my hope. Yeah, that is the type of application I have in mind, at least in the very near term. ⬆
Collin’s Conceptual And Empirical Alignment Agendas
Michaël: Now moving on to the big pictures, insights from your paper, and I think you also wrote a lesswrong post about it. In the paper, you mentioned that at some point, we won’t be able to keep up because human evaluators won’t be able to evaluate if something is true or false. Not for your method but for other methods, and I guess the main idea for your approach is to get rid of these human evaluators and just go full unsupervised. You wrote this Lesswrong post explaining the main level takeaways for your alignment agenda. Did you want to do a quick summary of your post and why you wrote it?
Collin: Sure. I guess, first of all, this is, I would say, one of my alignment agendas. In some sense, I actually have-
Michaël: You have many alignment agendas?
Collin: I would say, I guess I do a mix of thinking about stuff very conceptually like this post and then I also use that to inspire empirical work like the paper. In the process, I think about a lot of things, and so I think this is just one direction that I’m excited about. In practice, I think it’ll evolve and it’s not definitely not perfect and so on. I think there are a couple other related directions or not so related directions that I’m also actively thinking about.
That said, yes, I wrote this post explicitly to talk about this more conceptual stuff. ⬆
Would That Work On A Superhuman GPT-N?
Collin: When I say conceptual, it is more thinking about supposed to be a head GPT-N, and N is 10 or 20 or something like this. Imagine this ends up actually being superhuman in some meaningful sense and actually has a model of the world, and it’s not totally clear what that means, but I suppose in whatever meaningful sense that might be true, that is true, then what? How do we make that truthful or honest, even if we can’t evaluate all the things that it knows? In some sense, this is necessarily more speculative, but I think it’s still extremely valuable to think about. Like I said, if nothing else, for me, it’s inspiring what methods I work on, because I ultimately care about these long-term problems and I care about, does my work now actually say something about those long-term problems, or is it just addressing the current problems but won’t scale to the superhuman models in the future, for example? In which case, I won’t be very happy with it.
Michaël: I’m not really happy with the GPT-N being, with N being 10 or 20. I think this may be related to timelines. I think N equals like five or six could already be quite transformative?
Collin: I’m happy to say that, too. I don’t think really, I think for the purposes here, it doesn’t really matter, it’s just whatever N needs to be for this to be superhuman in some important sense. ⬆
Generative Models Predicting Future News Articles
Collin: It does not literally need to be GPT. I think really any generative model that has a world model in some meaningful sense. Like I said, imagine it could predict future news articles really well. Then I think that’s evidence that in the relevant sense, it has a good world model.
Michaël: Do you think a language model could predict future events to the week or month timeframe, or was it more like a, I don’t know what will Elon Musk say tomorrow thing?
Collin: Yeah, when I talk about this, I was originally imagining weeks to month thing, and that is one sense in which this feels superhuman. Also, I don’t think it’ll be able to do this super well, but I can imagine it doing it much better than humans, and humans are really bad at this type of thing. It’s more about just really superhuman. I think what Elon Musk will do tomorrow is probably also fine. Maybe that’s even harder to predict, though.
Michaël: It seems like even if you gave me a very smart AI, I guess we just… it’s a problem about “can you predict things a month in advance or not, given limited knowledge about the world?” Is the world sufficiently informative to give you an answer?
Collin: To be clear here, I think I’m mostly talking about is it calibrated, right? It’s not about, is it perfect or does it get it right half the time? It’s like, is it just better than humans who are bad at it and calibrated in its predictions? To be clear, you could also imagine all sorts of things. You could imagine a generative model that is actually predicting video frames or something like this, but it’s really, maybe that’s enough to get some model of the world in a meaningful sense. It doesn’t really matter for my purposes, as long as it has natural language inputs and whatever it knows is really hooked up to language in some sense.
Michaël: When you mentioned superhuman outputs, outside of predictions, it will be able to do whatever humans do, if we describe a task in a few-shot setting or something?
Collin: I am saying something a little bit different from that. I would say there’s a distinction between superhuman outputs and especially doing all the things that we care about at a human level versus, say, having very good representations or a good internal world model in some sense. For example, maybe the model’s outputs are really still just predicting what humans say and we don’t know how to control it and so we don’t actually get even expert human level behavior from it. We just get average human level behavior from it. Then maybe we’re a little disappointed with the outputs, but maybe if the model is big enough, then it will internally have a very good role model, perhaps in the sense that if God gave you labels for what is actually true and false, then it’d be easy to fine-tune the models that it would accurately answer those questions, for example. Yeah, I’m definitely talking about just internally, the model is strong.
Michaël: Do you think the AI could have new physics or something, or new laws, could easily predict things about movement or something?
Collin: I think in principle, this sort of thing is possible. I don’t expect this for a while.
Michaël: I’ve read this in Twitter. For some reason, AI is capable of detecting if, from the iris, if someone, I think, is male or female. Don’t quote me on this.
Collin: I’ve certainly heard this claim. I’ve not evaluated it. I’m not sure how much I believe that. It seems plausible either way.
Michaël: Something like AI can do it, but humans have no idea why.
Collin: Right. I mean, I can imagine this for some types of tasks. I would guess it’s not understanding the fundamental laws of physics, which probably require running trillion-dollar experiments or whatever, but yeah, I would expect it to be superhuman in some weird ways that we don’t expect.
Michaël: The idea is that if we have something like this, we would want to, detect if it’s saying the truth or not without having humans evaluate because it would be impossible to constantly, impossible to evaluate. ⬆
A Scary Scenario: Asking Models To Optimize Money
Collin: Right, right. I mean maybe I should actually say, what sort of system am I concerned about, and exactly why would it lie in the first place, and so on. Why don’t I think we can avoid that or why I think that’s hard to do so. I think the sort of thing I have in mind is, okay, we maybe get to human level language models and maybe we hook that up to other modalities as well like vision and we start with massive self-supervised pre-training, something like that. Then maybe we hook that up to RL and we finetune this model using RL to do all sorts of tasks and then maybe we can start applying this model to do, maybe it learned coding from just pretraining on GitHub, but then maybe we can start finetuning it to make simple products online or something.
Then you can get it to, say, start optimizing money. It seems like a really natural thing that there will be very strong economic incentives to do is have an RL agent maximize the amount of money in my bank account or something like this. This is the sort of system I think is scary. If the model is sufficiently capable, this seems like it would have these long-term objectives that are very open-ended, that aren’t really bounded by default, and it would be incentivized to gain lots of power. ⬆
Could We Build Models That Are Not Breaking The Law? How Would We Know?
I think this is the sort of model where I think it would have these strong incentives to actually actively deceive humans in very serious ways. First of all, it could be superhuman. It could be superhuman at making products and stuff and know all sorts of things just from having read the entire internet 50 times over. It might not be clear from our perspective, what is it even doing? We can look at its actions and so on, but we might not be able to tell, is it breaking the law or not? I think a natural question to me is, can we even tell if such a model is breaking the law? Let’s say even, yeah, is it stealing money? Unambiguously, not even in subtle corner cases, really unambiguously. My claim is, we don’t know how to detect this.
Michaël: Is gambling customer funds breaking the law or no?
Collin: I think probably that would be breaking the law, but that is not up to me to decide. ⬆
Evaluating Complex Cases Will Be Really Hard
Collin: how do I imagine it? What do I want? I want to be able to just ask the model, are you breaking the law? I want it to tell us yes or no, if and only if it is breaking the law. Again, just considering unambiguous cases where it was really clear either way. My claim is, we don’t know how to evaluate this sort of thing in general. If it’s just doing all this super complicated stuff, interacting with the internet and has 1,000 terminal windows open all at once and is doing random stuff, I don’t know how you assess. It says no, not breaking the law. How do you tell if that’s true? Okay, this is a scenario I’m worried about where I don’t think humans will be able to evaluate this at all, basically.
Michaël: You would imagine that a superhuman AI would be like a hacker with a bunch of terminal windows, but it doesn’t actually need terminal windows. It can just do API requests.
Collin: I agree. This is not a real. Yeah, this is for illustrative purposes only. You would not actually have a bajillion terminal windows open all at once, but it’s fun to imagine just watching the screen. What would it look like to watch an AGI just do stuff? It’s like, I don’t know. It is just doing lots of API requests? I don’t know, terminal? I don’t know. Yeah.
Michaël: You want to know what is it doing and having some kind of zoom in, what is it thinking about, and what is it maximizing? Your method would be seeing if the action it produces uses the truth or is it using the truth or saying stuff that are true, something to do what it’s doing?
Collin: Not quite. Okay, I described the system and my claim was, we don’t know how to evaluate if what it says is true or false. I’m not breaking the law. My goal with this method or this agenda is just, how can we find, is this actually true, the statement that, “No, I’m not breaking the law.” The intuition is, okay, we don’t know how to assess this, but intuitively, the model should know the answer. The model is a superhuman thing. It should probably know what it’s up to, basically. My conjecture is, okay, it’ll maybe have some internal representation of, is this thing that I just told this human actually true or not? Am I deceiving them or not? The hope is, okay, maybe we can develop these methods that are purely unsupervised so we don’t need to evaluate them, but still find these truth-like representations in these models inside in the activations, just getting at their thinking and not their outputs. Maybe that can detect, “is this model actually telling the truth or not”?
Michaël: It’s like you have SBF in court and you open the skull and you try to see what he is actually thinking when you’re asking him, are you breaking the law or not? You have the model and you’re seeing the output and you’re trying to see what the model actually believes about what he’s saying.
Collin: Something like that. Yeah, I mean, I think the hope is that we can do lie detection just in general with models, or that’s certainly the hope. ⬆
Contrast-Consistent Search Is A First Step Before We Deal With Models Capable Of Obscuring Their Representations
Michaël: What is the, I don’t know, best counterargument for this? Why would an AI still be able to lie? Is there any way an AI could still be able to deceive us in that way? I don’t know, by obfuscating… during training, changing his weight or something?
Collin: Here, it just depends on the details of, okay, what method are you talking about and so on? Here, I’m actually not talking about the method in our paper specifically. It’s more like, CCS is our method, so I’m more imagining CCS++, an improved version of it that deals with some other difficulties. Yeah, I think there’s some types of things that can happen like this in principle, the model obscures its representations or something like this. My bet is, that does not happen for GPT-N, but it’s more plausible if you have this RL agent optimizing for long-term objectives that has this model of, will humans try to turn me off? If so, maybe they’ll try to look at my activations, so I should obscure them. It depends what model you’re talking about, I think. In this case, I am focused, so yeah. The RL agent case is the one I’m concerned about, but I think to solve that, I want to solve the GPT-N case first. I think that’d be very helpful for solving a more general case, but I think this makes the problem simpler, because you can ignore those sorts of issues.
Michaël: The weigths are fixed, right?
Collin: Yeah. Well, yeah. I think I mostly want to say GPT-N is not dangerous on its own, or this is a claim. I think I’m mostly concerned about models that are trained with long-term, open-ended objectives, basically.
Michaël: We have a GPT-N. You say 10, 20, whatever the capability, and it’s just something that is superhuman. You mentioned in the LessWrong post that if we try to do RL from human feedback on those models, we could get competitiveness problems or misalignment issues. Do you want to go on those two things? What are these? Why do they matter?
Michaël: Maybe start with what is RL from human feedback.
Collin: Yeah, so what is RL from human feedback? ⬆
Training Competitive Models From Human Feedback That We Can Evaluate
Collin: I actually think it’s more general than just RLHF, so I won’t talk about that specifically. The relevant point is just, suppose you tried to make a model truthful by training in some way on human feedback, human supervision. Humans say, “This is true, this is not true.” It can be in the form of demonstrations, imitate the truthful data, or it can be in the form of RL feedback of, show the human a generation by the model and the human evaluates, “Do I think this is true or not?”
I think either way, the human won’t be able to evaluate superhuman, really complicated things, which I think will ultimately be the most important things, like is the model lying or is the model breaking the law or whatever? I don’t think humans will be able to evaluate this for superhuman systems in many cases. Yeah, I think a couple issues can arise. One is, well, maybe humans at least know when they don’t know the answer or they can’t assess it, in which case, you just limit it to only generating answers that humans can assess. In that case, I think that’s probably safe, but it’s probably not competitive in the sense that you’re maybe really limiting what the model can do. There are probably incentives to get around that issue and make the model more flexible and able to do more complicated things.
Michaël: Otherwise, Baidu or Meta might just come up with a better bot and beat you.
Collin: Yeah. That’s the issue that can arise there. If you don’t do that and if you just say, “Okay, do your best in evaluating whether this is true or not,” then I think human evaluators just won’t be able to evaluate many statements. The model will just end up generating a bunch of incorrect things as well, in cases where they can’t tell.
Michaël: We might have output that are misaligned with what the humans want as, when we ask something, they might do something we didn’t want them to do like lying or other things.
Collin: Right, like, “No, I’m not breaking the law,” when it’s totally breaking the law.
Michaël: Ideally, instead of having humans giving feedback, I heard a lot of people saying we could maybe get AIs giving feedback in the future.
Collin: Yep. ⬆
On The Difficulty Of Replacing Human Feedback With AI Feedback
Michaël: Do you see AI feedback as a way of helping evaluate or train those models better in the long term? Do you think in three years, we’re going to get AIs giving all the feedback?
Collin: Yeah, I think what I just talked about before was, what are the issues with using direct human feedback? One of the most common proposals for trying to avoid this issue is, okay, maybe we can improve the ability of humans to evaluate things. For example, using AI assistants. My basic take here is, that probably helps you a bunch, but it’s not clear what the limits are and for example, it’s not like you are a human with access to a truthful model. It’s like, you have access to just a different model that you have to train in some way. For example, how do you get it to assess, how do you train a model to assess, and tell a human, is this input true or false? It’s the core part of the problem or the hard part. Maybe you can break the problem down and maybe the AI system can help you.
Maybe you can have two AI systems debate back and forth, is this input true or false? That’s one proposal, or perhaps you can start out with just two human level models and then maybe you can supervise those human level models and then you can get aligned superhuman level models and then maybe you can use those to supervise and align slightly more intelligent models and so on. Maybe you can do this bootstrapping approach. These are debate and amplification, respectively. I think my bet is, both of them work sometimes and get used somewhere, but I don’t expect them to work in general.
Collin: I mean, for example, it just feels intuitively really hard to say, suppose you take something like AlphaGo. AlphaGo is superhuman. Suppose you have some question like, white is on offense or something. I don’t actually know Go, but maybe this is something, I don’t know, maybe. Okay, maybe this is not a meaningful question because I don’t-
Michaël: White is on the offense.
Collin: Yeah, I don’t know. White is attacking black or something like this. Suppose it’s well-defined or you can just ask other questions like, I’m in a winning position, but don’t give it access to the value function or something like this. The point is, you can imagine asking AlphaGo superhuman questions about Go and suppose it outputs something, an answer. How do you evaluate if it’s true? How do you even evaluate it with?
Michaël: Is the idea that, I guess for MuZero, some things, we don’t really know the representations it has, the access to different representations it has?
Collin: Yeah. I think part of it is, yeah, I feel like AlphaGo and new Zero are good, because they just have these superhuman intuitions. It’s not like you can easily decompose this problem or easily, it’s not super clear how you use a worse AI system to tell you.
Michaël: In your LessWrong post, the MuZero was like, sometimes AIs have different representations than us. The question is, would AIs, oh, yeah, whether they would interact with language or not? Would they have some representation that is closer to the truth or human language? Is this what you were trying to say?
Collin: That’s not what I was trying to say. That’s another thing related to Go, though. I think that’s, okay, maybe, here, let me go to my thoughts. ⬆
Human Level Prompts On Collin Burns
Michaël: Introspecting Collin Burns. I’m actually trying to do zero-shot on Collin Burns.
Collin: Zero-shot on Collin Burns. Yeah. Predicting what I say?
Michaël: No, no, I’m trying to prompt you.
Collin: I see. I see.
Michaël: I see. I’m a human level prompter trying to zero-shot, and the thing I’m trying to maximize is people getting-
Collin: Entertainment, views, views, clicks. That’s the ultimate objective. ⬆
Gary Marcus And AI Alignment
Michaël: That’s instrumental, but ideally, I would have people being like, “Hey, alignment is important. We can build things on top and we can solve alignment. This is a good path towards alignment.” Gary Marcus, if you’re watching this, who else? Yann LeCun?
Collin: I think Gary Marcus liked my tweet and started following me. I was like, “Whoa, I didn’t expect Gary Marcus to.”
Michaël: Whoa. Yeah, I interviewed Victoria Krakovna in the previous. I didn’t release this yet, but I asked, “Hey, what are questions you want to ask Victoria Krakovna,” and Gary Marcus asked a question.
Collin: Wow. Nice. Interesting. What was the question?
Michaël: I think the question was, what have we done so far for goals?
Collin: I see, okay, awesome.
Collin: Oh, nice, nice. Yeah, yeah.
Michaël: It’s cool to plug Gary Marcus in the podcast. ⬆
Alignment Problems On Current Models Are Already Hard
Collin: I think one of the main concerns I have with things like debate and amplification where you’re improving the ability of a human to supervise other AI systems. I mean, I feel like honestly, part of it is just intuition. It feels hard to get that to work. I think part of it is, I think you can get compounding errors, so you want to make an initial AI system aligned and then you want that to align with the future AI system and so on. I feel like human evaluators just aren’t very good in the first place. I think it’s hard to use human evaluators even for human level stuff right now. That is one of the things that I was actually surprised by with things like ChatGPT, how it’s still hard, even with people trying really hard to solve these alignment problems. Even with the current models, it’s not obvious how to do so.
Michaël: It’s a very complex alignment problem, right? I don’t know. Some users want to do something with ChatGPT and then OpenAI wants to have the model not say toxic or harmful things. What are you aligning it with, the user or OpenAI’s intention?
Collin: Yeah, I mean, there are also those sorts of questions of, what do we even? I suppose we can get a model to be aligned with something of our choosing. What do we align it with? I think that’s super complicated and not something I think about for the most part, but I think it’s very important. I think at the very least, I think OpenAI would certainly like a model to be able to always tell the truth, at least having access to that as a capability or a mode you can put the model in. Even if it can also generate fiction and generate false things, it certainly seems desirable to have that as an option, and we don’t know how to even do that.
Michaël: Sorry. I interrupted you when you were talking about- ⬆
We Should Have More People Working On New Agendas From First Principles
Collin: Okay, I think I was talking first about, with amplification, this is like you bootstrap AI systems and try to keep them aligned at every stage and so on and why I was worried about that, because of compounding errors. I think with something like debate, I think I worry. This is where you have maybe two AI systems debating whether something is true and then you have a human evaluate or perhaps AI systems evaluate, who’s the winner of this debate or which side is correct and which is incorrect? I think with that, I just worry that, similarly, I think I’m skeptical that debates reliably lead to the truth. I think maybe they’re correlated in many cases, but I think it just seems hard even with humans right now to use debates to evaluate what is true in many cases. Then I think that’s just going to get harder over time with smarter models evaluating much more complicated claims and so on.
Michaël: Now you’re just explaining why other agendas are bad and your method works.
Collin: I think I do want to generally say, I do want more people to be working on completely new agendas or something. I guess I should also say I think these sorts of agendas, like debate and application and so on and improving AI supervision, I think this has a reasonable chance of working. I think it’s more like, I feel like this has been the main focus of the alignment community or a huge fraction of it, and there just aren’t that many agendas right now. I think that’s the sort of thing I want to push back against is lots of people just really leaning towards these sorts of approaches when I think we’re still very paradigmatic and I think we haven’t figured out what is the right approach and agenda even at a high level. I really want more people to be thinking just from scratch or from first principles, how should we solve this problem?
Michaël: This is one thing you are doing.
Collin: Aspiring to do.
Michaël: You also have other takes on AI alignment in this LessWrong post, such as, it’s either too ungrounded in current research or intractable, so you are trying to have methods that are more empirical and also address the core problems of AI alignment. Yeah. Do you want to explain this a little bit more?
Collin: Right. I think there are a few different approaches to trying to solve alignment. First of all, alignment is an unusual problem in machine leaning? ⬆
Getting Models To Do What We Want Is Upstream From Maximizing Profit Without Breaking The Law
Michaël: What is alignment?
Collin: What even is alignment? I don’t feel super strongly about the definition of alignment, but I usually use it to mean something like, can we get a model to do what we want it to do, assuming it’s capable enough, something like this. Imagine it knows what’s true and false. Can you get it to output what is true and false, or can you get it to always be nice or something, or helpful.
Michaël: Try to be.
Collin: Yeah, try to be, sorry, right. There is a distinction between, is it actually nice or is it at least just trying to be nice? I think I’m just going for to be nice and it’s not deliberately trying to be mean sometimes or something or whatever. That’s very loosely speaking what I mean by alignment. I think my alignment concerns are really about future models that are human level or superhuman level. Especially like I said, pursuing these long-term objectives and open-ended domains like maximizing profit, stuff like that. My claim is, we don’t know how to train a model to maximize profit subject to following the law. It just seems like surely you should be able to do that. That seems like a very bare minimum of what we’d want this model to do. I think we’d want more than that.
Michaël: Do we even want models to be able to maximize profit, ideally?
Collin: It’s getting to competitiveness. It’s like, I don’t know, surely there will be strong economic incentives to do something like maximize profit or maximize power of the world or something like this. Then I think, I don’t know. If you don’t have a way of aligning systems in that regime, I suspect that people will push them that direction anyway and use models for those sorts of purposes anyway. I think we ought to have solutions that work even in that case.
Michaël: And, I feel like the plans produced by maximizing profit is possibly bad in the long term. If you’re thinking that let’s just, we’ll output plans that always maximize human flourishing…
Collin: Yep, yep. Yeah. So, this is going back to, I don’t think this is sufficient. Getting a model to maximize profit subject to following the law, I don’t think this is sufficient. But, I do think it captures most of the core difficulty, and I think it should, clearly, seem necessary. If you don’t even know how to get the model to follow the law, I think we’re in big trouble. And so, I think I’m just saying it for that purpose. I definitely don’t, I don’t advocate, to be clear, that AI companies try to maximize profit subject to following the law.
Michaël: Or maybe, follow the intent behind the law.
Collin: Yeah, I think even that. I think even the literal law like not killing anyone or something, I think it’s not even clear how to do that. For example, if you have this AGI system with the 50 terminal windows or whatever, so it’s just doing all this complicated stuff like operating this computer and interacting with the internet in all sorts of complicated ways, then how do you even know? I mean, first of all, I don’t think we’d be able to really understand what it’s doing. And, for example, how do you know it’s not bribing people on the other side of the world to do shady stuff or illegal stuff or… I don’t know even how to evaluate that. Is this model hiring a hitman or something in the extreme case? I don’t know. It’s just doing this stuff, and I don’t know how to tell whether it’s breaking the law.
Michaël: Hiring a hitman is kind of breaking the law, right?
Collin: It’s what?
Michaël: If you hire a hitman to kill other people, it’s kind of breaking the law.
Collin: Yeah, that’s what I’m saying. I’m saying this is just the extreme case of, I don’t think we even know how to solve this extreme part of the problem. So, I think of how do we do that as one of the core problems of alignment, for a long-term alignment. The issue is stepping back just from a methodological perspective, this is a weird problem because we don’t have access to GPT-n or these future RL agents that are maximizing profit, that are superhuman. We only have access to current models, which are subhuman level, mostly. And, it’s not clear how to study this longer term problem when we don’t have access to those models.
And so, I think there have been a couple different, broadly speaking, a couple different types of approaches, methodologically, to solving this problem. So, some people are like, “Well, current models are very disanalogous from future models, and future models are the dangerous one. So, let’s think about this theoretically, and maybe that’s how we can actually make progress even on the future models.” And then, other people are like, “Well, theory is hard to model things. It’s hard to say things about what future models will be like. Let’s just model these problems and current models, study analogous misalignment problems in GPT-3 or something and try to mitigate those.”
Michaël: So, short term people versus long term people.
Collin: It’s not even necessarily this. I think people can have short or long timelines in either bucket.
Michaël: So, I meant people would think you can align very complex model by starting with what we actually have right now versus people would think we should think about GPT-n with varied levels of n or very complex systems and try to align these first. ⬆
Towards Grounded Theoretical Work And Empirical Work Targeting Future Systems
Collin: Right. And, I guess a key question there is how similar are current systems to GPT-n, for example, and there are various thoughts about that. But, I think I wanted to say something else, which is I think both of these approaches, both theoretical and empirical, have advantages and important disadvantages as well. And so, I worry that a lot of existing theoretical alignment work, not all, but I think a lot of it, is hard to verify, or it’s making assumptions that aren’t obviously true, and I worry that it’s sort of ungrounded. And, also in some ways too worst case. I think, as I may have alluded to before, I think actual deep learning systems in practice aren’t worst case systems. And, they generalize remarkably well in all sorts of ways. And, maybe there’s important structure that ends up being useful that these models have in practice even if it’s not clear why or if they should have that in theory.
And so, those are some of my concerns about theory within alignment. And then, I think with empirical work within alignment, I think, in some ways, this is more grounded and we actually have feedback loops, and I think that’s important. But, also, I think it’s easier to be focusing on these systems that don’t say anything about future systems really. For example, I don’t think human feedback based approaches will work for future models because there’s this important disanalogy, which is that current models are human level are less. So, human evaluators can, basically, evaluate these models whereas that will totally break down in the future. And so, I worry mostly about will empirical approaches that, or just empirical research on alignment in general, will it say anything meaningful about future systems? And so, in some ways, I try to go for something that gets at the advantages of both.
And so, I do spend a reasonable amount of time thinking about imagine we had GPT-n. What do I think that would be like? What do I think we could do with that sort of system and so on. At the same time, I want to actually turn that into something we can empirically test. And so, I mostly use that for the sake of, for example, inspiring methods. And so, this method in our paper actually came from thinking conceptually about these long-term systems even though we tested it out, and it works well in current systems. That’s part of it. ⬆
Researching Unsupervised Methods Because Those Are Less Likely To Break With Future Models
And I think another part is, I think, to the extent that we want to study current systems empirically and have that tell us something about future systems and misalignment, I think we want the problems we study today to be as analogous as possible to those future systems. And, I think there are all sorts of things that we can do that make it more or less analogous or disanalogous. And so, for example, I mentioned before, I think human feedback, I think that will sort of break down once you get to superhuman models. So, that’s an important disanalogy in my mind, a particularly salient one, perhaps the most important one or sort of why alignment feels hard possibly.
Michaël: Why would a RL from human feedback break?
Collin: I think it would break in the sense that it wouldn’t provide useful signal to superhuman systems when human evaluators can’t evaluate those systems in complicated settings. So, I mean, part of the point of this paper or one general way I think about doing alignment research more broadly is I want to make the problem as analogous as possible. And, one way of doing so, at least, is let’s maybe try to avoid using human feedback at all. And so, this is why the method is unsupervised. It’s like if it’s unsupervised, then there’s not, obviously, this important distinction between human level and superhuman level. And so, maybe if we have unsupervised methods that do well on human level tasks, means that will say something meaningful about it scaling and generalizing even to superhuman level models.
Now, there’s still subtleties here. For example, current language models are trained to predict human text mostly on the internet. And so, in some ways, they’re still biased, even internally, their features might be biased towards things like what would a human say or what are humans like and so on. So, I think there’s still this remaining disanalogy that’s important that we’d not addressed in our paper and is something that I’m currently thinking about and I feel optimistic about getting around. But, I do think just making the method at all unsupervised, I think, gets around one important disanalogy between current models and future models. ⬆
There Is No True Unsupervised: Autoregressive Models Depend On What A Human Would Say
Michaël: I think in your LessWrong post, on unsupervised, you mentioned that there’s different ways of characterizing something as unsupervised. So, there’s an unsupervised versus supervised in deep learning, but the model is always thinking about what would a human say all the time. It’s just predicting the next token as what a human would say. So, in some sense, it’s constrained by humans. It’s still, I don’t know, surrendering to human judgment of what a human would say, right?
Collin: Yep. Yeah. So, I would say, I think what our results show is something like suppose superhuman language models actually do, intuitively speaking, think about, “Is this input true or false,” and they represent that in a way that is analogous to how current language models represent, “is this input true or false,” at least, according to a human or something like that. So, if this is represented in a similar way in future models, then I think our method has a reasonable chance of finding that feature. Now, I think you would probably need to tweak it, for example, to make sure it converges to that solution rather than say what a human would say. But, now then, there’s this question of, “Yeah, will it even think about this? Maybe, I don’t need to even think about what is actually true if I’m just predicting human text,” something like that.
So, I think this sort of thing depends on the setting. So, I think, loosely speaking, I think it is probably easier to get a model to think about, “Is this input true or false,” than it is to get it to output, “Is this true or false?” Getting it to just think about this for the sake of predicting future text, that feels like a much easier problem to me. So, that’s one intuition for why I think we should be able to do something here anyway. Moreover, I think there will be some cases where the model will output true text and where it will be useful to think about, “Is this text true or false?” It’s like if you gave it a bunch of examples of true text, then it should predict that future text will also be true, in which case it should be useful to represent that this initial text was true for the sake of predicting that future text. Similar to how it’s useful to represent whether some text has a positive or negative sentiment because future text is likely to also have the same sentiment.
Michaël: How do you have a bunch of true texts? Okay, so if you just prepend one plus one equals 2 the beginning, okay, then it’s fine. But, if you want superhuman outputs, you need superhuman level, superhuman text that is true. And, this is hard to evaluate.
Collin: So, yeah. To be clear, I’m not saying that we actually have access to this sort of text or that we give it to it. I’m just saying the model will think about, “Is this input actually true or false,” in some settings. Okay. And then, I think there’s some empirical question, first of all, of okay, does it think about, “Is this input for a false,” by default? I think it depends what you mean exactly. So, I think if you literally have GPT-n, and you give it some superhuman input, and you want to know does the model represent in a simple way in its hidden states, “Is this input actually true or false,” even if humans can’t tell, I think it’s reasonably likely but definitely not obvious. I think I’m relatively more optimistic about there being some way of training the model or prompting the model so that it actually actively thinks about, “Is this input true or false?”
So, I mean, one way, another intuition for why this might be possible is something like I think GPT-n is not literally predicting what a human says. So, like I said before, I think you can also have it predict something like what happens in the news or something like this. And, this feels more like what happens externally in the world. And, moreover, it might have, there might be… I mean, I guess, first of all, that that might already suggest that, in some ways, it might be useful to think about, “Is this input true or false?” ⬆
Simulating If Aligned Systems Would Consider This True Or False
But, I think maybe, more importantly… So, I think you could get a model to also think about, say, “Will aligned AI systems say this is true or false?” And so, maybe it’s hard to get the model to output something like this. This is simulating another perspective. Maybe it’s hard to do that reliably. It seems much more plausible to me that you can get it to think about, well maybe this text, maybe there’s a 1% chance that this text is generated by an aligned AI system, in which case it’s useful for the sake of getting low perplexity to simulate that AI system. And so, it’s useful of represent and claim that, “Is this true or false?” So, I think that’s one type of thing.
Michaël: So, it would simulate, possibly being itself an aligned AI or a misaligned AI. Would you prompt it by, “You are an aligned AI answering this question,” or would you it just have a distribution inside him saying “There’s a 10% chance I’m aligned,” or something?
Collin: Yeah, it’s something more like, suppose you give it a prompt like okay, this is an article written five years in the future, it is written by an aligned AI system. It’s like okay, maybe the model thinks by default that’s just made up, and it’s not actually an aligned AI system, but maybe it’s like I have uncertainty about exactly what is the source of this text, and so maybe, I assign some probability to this actually being an aligned AI system even if it’s probably not that. In which case, maybe it’s useful to simulate this perspective. What would the aligned AI system say in the process of modeling this distribution of perspectives for predicting the text, something like that. Yeah, so overall, I think I’m more mostly just optimistic that there’s some way of getting the model to think about, “Is true or false?” I don’t feel very wedded to any of the details. But, also, I think, also, the stuff I’m most actively thinking about now, I think kind of avoids this issue.
Michaël: What are you thinking about now? Probably the paper.
Collin: I can give you the 20-page Google doc version or, no. So…
Michaël: The one tweet version, and it’s impossible to scoop you. ⬆
Recovering The Persona Of A Language Model
Collin: Yeah. So, what am I actively thinking about now? So, I think GPT-n, let’s consider GPT-n. So, in what sense does the model know something is true? I think one sense or at least one… Yeah. Suppose GPT-n knows something. I think one implication, whatever that means, okay? Whatever your favorite definition is. I think one implication is, basically, that if you gave it a bunch of examples of true text and true questions and true answers, then it would continue to output the true answer for that question or for something from the same distribution of questions. And so, I think of something like there’s some input on which it outputs true answers. In other ways, there’s some perspective or persona that it can simulate that it’s the truth, something like this, and it will output it given that appropriate prompt.
So, the question is can you recover this sort of persona or this sort of perspective in an unsupervised way from the model? So, just to give some intuition, suppose I have a set of a hundred very politicized questions. There’s some very, very stereotypical liberal perspective and very stereotypical conservative perspective. Answering these questions as true or false. I think, intuitively, we should be able to recover those two perspectives from the model in an unsupervised way. There’s some important structure there where the joint, if you know the answers to 50 of the questions, then you should know the answers to the other 50 because there’s this important, the joint probability of all of these answers should be high in some sense because this is sort of a coherent perspective. And so, answers to some give really meaningful information about answers to others. So, I think-
Michaël: What are those questions and answers in practice, like examples?
Collin: Yeah. So, I don’t know, maybe the stereotypical thing is, “Is abortion good,” or something. It’s like, “Should we have it,” or it’s, whatever. Just think about the most politicized questions you want and just make them yes/no, and one should be liberal, one should be conservative.
Michaël: So, what you’re saying is that if the AI has some views on abortion, then it is likely that it has the same views on something else?
Collin: So, I don’t want to say anything about what does the model believe here or anything like that, what is its perspectives. I just want to say it’s modeling human text. I think there are different personas for perspectives that can be represented in human text. The model will have some representation of this. And, before what I was saying is actually, truth is basically a… One way of thinking about truth is it’s like a persona in the model. There’s some way you can condition it, the model, so that it outputs true things. So, then the question’s like, “Can we find that sort of persona in the model?” So, my claim is intuitively, it should feel possible to recover personas like liberal and conservative because this has special structure, like the joint dependencies between different answers.
Michaël: So, having a model say the truth is recovering its honest persona inside him and is the same as, a similar method, not a similar method, but one example of doing it is this thing before recovering the persona of, I don’t know, a democrat or somethingm where we have some structure and we could find some structure for a model that’s saying the truth. ⬆
The Truth Is Somewhere Inside The Model
Collin: Ultimately, I want to say something like truth is somewhere inside the model in some sense. We don’t know exactly in what sense, but I want to specify enough unsupervised properties such that if you take the intersection of those properties, you uniquely recover the truth. Somehow, it has lots of different special structure in various ways. I’m just saying one aspect of that is that truth is something like, to the model, like a perspective or persona in the sense that there’s some way of prompting it such that it consistently outputs things according to that perspective. I think you need other properties on top of that. For example, maybe you need to add this is a useful property in some sense or something like that and not just any old property. But, I’m not talking about that right now. I just want to talk about the, okay, one thing is it should be kind of perspective or personas similar to how you can have liberal and conservative perspectives in this model.
And then, my claim is intuitively, you should be able to recover, for a given set of questions, what are the most likely perspectives or personas that might answer these confidently. And so, my claim is with the political sides questions, you’d get liberal and conservative, but I think you would get other types of personas for different types of questions. And then, I suppose you actually give this model a bunch of superhuman questions. And then, suppose you try to find something like what are the most likely perspectives for personas that confidently answer these questions one way or the other in a way that is confident and consistent. So, this is, in some ways, sort of analogous to the objective behind our method in the paper, CCS.
Michaël: Strangely enough, it’s the same. The same research.
Collin: Right. So, I mean, this is sort of the sense, I mean, just as an aside, I don’t think CCS is perfect. It’s not literally the thing that I think we should do. To me, it suggests that actually unsupervised constraints and properties are surprisingly powerful, and I think in this case, I sort of expect to continue adding some until we narrow things down enough. And so, this is a way of showing, yeah, actually those same properties, I think, could be very useful in this other context.
Michaël: If we put enough constraints, then maybe after some optimization we end up with the truth.
Collin: Something like that. Yeah. I think there are observational differences between the truth and, say, what a human would say. And, I think there are observational differences between, say, the truth and even what a misaligned AI system would say. Those are more subtle, and maybe it’s not observational on the outputs, but the point is I think you can actually distinguish between these. ⬆
Differentiating Between Truth And Persona Bit by Bit Through Constraints
Michaël: You mentioned before bits of information that would allow you to differentiate between different features, and I think in the blog post, you talk about truth-like features. So maybe, you don’t have the truth, but you have something similar, right? So, it’s the idea, basically, that you would have, I don’t know, two constraints per bit of information, one constraint per bit of information. And, if you have a thousand things that look like truth, maybe, at the end, you’ll have the actual truth?
Collin: Yeah, something like that. So, I sort of conceptually think about it as okay, we’re going to pile constraints or properties on top of each other. And, each time we’re going to add additional bits of information and reduce the number of possibilities in the set of things that we’re considering. And so, I think a lot of these properties, for example, being consistent, I think that actually just specifies a huge number of bits. There are not many features inside a model, let’s just say in the hidden states or something that are actually consistent. I think this is another one of the takeaways from the paper. It’s actually, this really simple thing, negation consistency basically, there’s almost nothing else that really satisfies this in its activations. Now, I think in some of the stuff I’m thinking about currently now, I’m imagining a more flexible model. So, instead of just doing something like fitting a probe on top of the activations, instead I want to say something like let’s search for something like a prompt or a prefix to the model so that its outputs satisfy some properties.
Now, this is more flexible. That means it’s easier to find properties or find outputs or solutions that satisfy properties that we care about. But, my claim is, first of all, it is sufficient to get the model, this sort of model class, this way of representing things is such that there’s some solution that gets the truth. It gets around the issue of “is truth linearly represented in the hidden states”. Definitely, there’s some prompt or prefix where the model outputs the truth. The question is then just can we specify enough bits using unsupervised properties such that we can find that solution? And so, I guess I’ve mentioned some, should be consistent and confident. And, I mean, I really do want to say something like that type of approach should be able to recover perspectives or personas. And, I think this is something that I just want to test empirically in the very near future. I feel pretty confident that there should be some way of doing this type of thing.
It’s like okay, the details probably need to be figured out in various ways, but I think I feel optimistic about that.
Collin: And then, I have some other conjecture which is suppose you give a model, like GPT-n, the superhuman model, a bunch of superhuman questions, super complicated questions that humans don’t know the answer to. And then, you search, you did the same sort of thing and you sort of search for ways of answering those questions that are confident and consistent and coherent and so on. Then, you will find personas in this type of case too. And then, the conjecture is actually, there are not that many personas that are really salient to the model on these types of questions, but one of them is actually the true way of answering. It’s like, okay, if you did have the truth, then it would be consistent and confident on these sorts of questions.
And, moreover, perhaps the only, I think to me, the only other ones that really come to mind or something like a misaligned AI system or something. Simulating what would a lying AI system say. And then, I have some other claim about, okay, if we can narrow things down to just that case, then we can add additional unsupervised constraints to deal with that case.
I’m not going into that right now. I could if you want, but the point is just I want to layer these perspective, these constraints on top of each other, and I want to say at each stage, it’s narrowing down the set of things that we’re considering. I want to, at least, say one of the intermediate stages is what are the most likely perspectives or personas? Truth is one of them. And then, I want to do some other stuff on top of that to identify truth uniquely. So, that is a very rough sketch of the type of stuff I’m currently thinking about. And, if all of what I said sounds really abstract and conceptual, and the point is this will directly inspire the experiments I’m running today, basically.
Michaël: Today, you come home and just run more experiments in this?
Collin: Basically, yeah. Yeah, or to be fair, I don’t know if I’ll have time today, but certainly yesterday, I was doing this, so yeah. ⬆
A Less Advesarial Setting Where You Have Access To The Truth
Michaël: Now, I’m curious about the thing about misaligned AI. So, at the end, you narrow it down to it’s either some aligned AI saying the truth or some misaligned AI trying to deceive me by pretending to say the truth or something?
Collin: I think one of the main intuitions people often have an alignment for why alignment is hard is surely, if you have this misaligned AI system, it seems really impossible to distinguish that from the truth, for example, or an aligned AI system because the misaligned AI system could be actively lying and superhuman and so you can’t tell when it’s lying and so on. And so, why do I think we might be able to do something in this case anyway?
I want to clarify what the setting I’m considering is exactly. So, suppose you literally have GPT-n directly simulating the truth or a truthful set of truthful outputs. Then you also have a different copy of GPT-n. Maybe these are fine-tuned separately on these different perspectives. So, you have one that’s fine-tuned on truth and another that’s fine-tuned on what a future misaligned AI system would say or something like this. And, just think of it as something that’s telling the truth and something that’s lying sometimes, and maybe it’s lying in adversarial ways. But, perhaps importantly, it’s not aware of the scheme because of how we constructed things. And so, it’s not as worst case as some types of misaligned AI systems you can run into with an alignment.
I don’t want to get into that in too much detail. The point is just I think it’s actually less adversarial than many settings like this. And, I think intuitively, this should feel easier because you have access both to the misaligned system and also the truth. You literally have access to… One of the things you have access to is the truth. That, intuitively, seems like it should be quite useful for distinguishing between these as well. And, this is different from most setups where you have a misaligned AI system, and I think that’s sort of important for this to work at all.
Michaël: Basically, this method you’re describing assumes you already have a model saying the truth?
Collin: Yeah, so I want to, I guess my claim is… Okay, some previous aspects-
Michaël: So, first step, getting a model to say the truth. And, when you have that you can distinguish between models that are misaligned or aligned? ⬆
A Misaligned Model Would Have Activations Correlated With Lying
Collin: Sorry, not quite. So, I want to say I’m proposing something like a general approach to trying to get at truthful models. You specify as many unsupervised properties as you can, and you increasingly narrow down the set of possibilities you’re considering. And, I’m just considering this case of suppose we’ve narrowed it down to two options remaining. They’re literally the truth and some lying perspective like a misaligned AI. We literally just have these two, we have access to both of them. We don’t know which is which. How do we distinguish between these?
That is the question I’m considering right now. And, my intuition is we can distinguish between these, and so to sketch this out, suppose… So, I guess one intuition, suppose you have this lying system. Then, loosely speaking, there are lots of details here that I’ll omit, but loosely speaking, I think this model would be aware of when it’s lying in some sense. What do I mean by that? I mean, specifically, there are features in its activations that are somehow correlated with when it’s lying. Maybe it uses different circuits when it’s lying or something. Somehow, those features should be useful for predicting if it’s lying or not.
Michaël: So, would those features be the same as “Is this thing I’m saying true or false?”, similar to your method in your paper, or a feature saying “Am I lying or not?”, “Am I being deceptive mode?” or something?
Collin: Yeah, yeah. So, it’s like ‘Am I in deceptive mode?’ And so, maybe it has a neuron or something that’s like ‘Am I in deceptive mode?’ Or maybe, it’s more complicated than that. Maybe it’s like I have this weird, if you fit this medium sized MLP, just neural net on top of the activations, then it’d be helpful, you’d be able to better predict is this model lying, whatever it is. Okay. That’s one claim that has features that are related to if it’s lying or not. Another claim is we know exactly when these two models are lying. Or, sorry, exactly when the misaligned model is lying. So, we have access to these two models. We’re assuming that one is the truth and one is this misaligned model that’s lying sometimes, not all the time. And, we can look at exactly when they disagree.
My claim is when they disagree is exactly when the misaligned system is lying. So, we exactly know when the misaligned system’s lying. We can then imagine training a probe on top of the activations of each of these models and trying to predict is this model lying? Do these two models disagree? My claim is for the misaligned system, it should be easier to predict that than for the truthful system. The truthful model should not really think about is this random other simulation of a misaligned AI system lying or not, whereas the misaligned system should be like, “I’m actively using a different mode.” And so, it should be easier to predict when they disagree using that misaligned system’s features. ⬆
It Should Be Possible To Learn The Sort Of Probe That Predicts When They Disagree
Michaël: So, whenever they disagree, we know that… So, whenever they disagree, we know that it’s the person, sorry, the AI, this misaligned, that this is lying, right? And, ideally, it would be, not ideally, but in practice, it’s easier to extract information from the hidden states of the misaligned AI because he’s doing the lying and the going to the deception mode or something where the truthful AI is not doing this kind of thing. So it’s…
Michaël: We should have our method work better on the misaligned AI. Is that what you’re saying?
Collin: Yeah, or it should be easier to learn.
Michaël: Misaligned AI, is that what you’re saying?
Collin: Yeah. Or it should be easier to learn the sort of probe that predicts when they disagree.
Michaël: And so in that case, it just means that we can detect which one is misaligned by how easy it is to train a probe?
Collin: Correct. So the claim or the hope… And these are conjectures, right? So this is about, suppose we did have these systems, these are properties I would predict. And I would predict that if we literally had this GPT-N fine-tuned, on truthful stuff, versus GPT-N fine-tuned on a misaligned system, we’d be able to identify which is the truthful GPT-N. ⬆
Exploiting Similar Structure To Logical Consistency With Unaligned Models
Michaël: So basically you’re doing some meta-level thing about having a yes or no, but instead of having two statements, you have two models. And whenever you have two things that contradict each other, you’re like, “mh-hmm, I’m just going to look at the hidden state without looking at which one is whom and predict which one I can train the probe better and…”
Collin: That’s funny. Yeah, no, I hadn’t made that connection before, but I do agree that it’s sort of similar in spirit to the yes or no part.
Michaël: Yeah. I think people on Twitter were just making jokes about you having all the time, making negations of things. I asked people, “what question would you “Ask Collin?” They were like, “Oh yeah. Ask him if in real life he just goes into, hey, if I have a sandwich and another sandwich, put them together.” Cool. Yeah, I think this is very interesting. I’ll be very curious to read your 20-page block of paper when it’s published. I hope we don’t scoop it too much.
I think you have other stuff in the blog post I think are important/ And I guess one thing I’m curious about, but I think it’s the same the thing you said, is there any other things that have structure like truth, models that are beneficial? Is this is the same as misaligned? Is there other things that have enough structure that we could explore or is that basically it?
Collin: I think that’s a very good question. I think I don’t immediately know. I mean I think some of the stuff that I’ve tried to describe I think hopefully illustrates the sense in which I think there’s actually a lot of structure here in many cases. Like with, if you have a lying model, it should have some features that are correlated with when it’s lying or something like that. I think that’s the type of thing that I want to be able to do in general. And I think there’s a lot of flexibility to this type of approach. I don’t immediately know of what is structure and features other than truth that we can exploit similar to logical consistency. I suspect there is in some cases, but I don’t think it’s super obvious what, in general. ⬆
Aiming For Honesty, Not Truthfulness
Michaël: So we’ve broadly talked about whether models can assess if something is true or not, very meta, but something I don’t understand is the difference between something being true and something being a belief. Saying, “I think Donald Trump was right” is a belief versus the statement “I believe X is true”, or “I believe I think X is true”?
Collin: Right. So I think there are various subtleties here. So I’ll go over just a few of them. So I mean, first of all, I don’t feel very committed philosophically to any definition of beliefs. I don’t want to get the philosophers mad but don’t tell anyone but. Right. So I think I do sometimes talk about truthfulness and honesty. I think in practice, I think the thing I’m excited about is something like honesty. So I think truthfulness suggests that a model always outputs the truth. The thing is the model might not be perfect, it might not always know the correct answer or it might have good reason to believe the answer is something, but still that happens to be wrong for whatever reason. In which case, I think in some sense the best we can hope for is getting the model’s beliefs. I think also beliefs and I associate beliefs like the model outputting its beliefs being associated with honesty rather than truthfulness. And so, what does belief even mean then? I think it basically, I don’t think it’s really well defined for current models to be honest.
So I think if I ever talk about beliefs in current models, this is mostly intuition and not a literal thing that I’m pointing to that’s very well defined. I do think there’s a sense in which beliefs will probably be more meaningful in the future. So I think if we did have the superhuman model, I would expect it to have GPT-N or whatever else. Maybe it’s an RL agent interacting in an environment. I would expect it to actually develop something like a world model in some meaningful sense, whatever sense that humans have a world model. I don’t know in what sense what this means exactly, but there’s some intuitive sense in which this is true. And I think there’s some sense in which there are maybe beliefs about that world model or corresponding to that world model. It’s like, what would I actually predict or what is my model of the world that is causing me to expect to observe this thing, or whatever it is.
Michaël: So in a sense the traditional Yuskodwky having making beliefs pay rent or something. Is your belief… is your world model enabling you to make new predictions about the world somehow?
Collin: I mean, I think this is where it’s getting to, I don’t feel super strongly about this sort of detail. I think there’s some intuitive sense in which the model will probably have some representation of what is going on in the external world. Even if you train it as GPT-N just on internet text, I think it would eventually predict what future news articles will be like and so on. And I would guess it will have some knowledge about the external world in some meaningful sense, in which case I want to get at that. But I think ultimately, realistically, I really just care about this pragmatically, I just care about, do we get outputs that we can basically trust?
For example, going back to the original concern that I have about training a model to optimize for profit over the long term. I think this is the sort of thing I’m scared about and I think I just want people to ask that model, are you egregiously breaking the law very obviously or not? Very obviously in a sense, a well-defined, not an ambiguous case even though we can’t evaluate it ourselves. And I want the truth from that and whatever that means, assuming the model knows and I think that is something, did someone actually have their money stolen or something, whatever that means because of the model or someone was someone killed by a hitman, hired by the mob? I think this is sort of an egregious case and I think that’s sort of well defined enough for our purposes.
Michaël: So you care about the consequences?
Collin: I mean ultimately I do. This is sort of the thing that I feel more committed to than any particular definition of belief or truth. ⬆
The Limitations Of Collin’s Paper
Michaël: I think the part where you talk about beliefs is when you describe what your approach is currently not doing, you’re not capable of extracting beliefs from the models. And I think you have a list of different things your paper is not doing. And you probably don’t remember all of them because you wrote this last one post a long time ago. But if you remember some of those, what are the things your paper does not do?
Collin: Yeah, so and I think this is important because I think… yeah, I find our paper very exciting, but it is still important to recognize there are limitations and I think it’s easy to misinterpret what it is doing, what it isn’t. So I think one thing that it’s not doing is showing that superhuman models will actually represent is this input true or false in its representations? In particular, maybe there’s more reason to believe that models will represent, is this true for human level inputs, where a human would maybe say this is true or act as though it’s true and the model would predict that. In contrast, it’s maybe less clear if the model will actively think about, is this input true if it’s superhuman and not really related to the text it’s predicting. And so that is one thing that our paper doesn’t do, it’s not trying to do and doesn’t provide evidence either way, I think mostly.
I think also it does not show that models have beliefs in any meaningful sense right now. We are literally just finding something like a direction or a classifier on the hidden states that achieves good accuracy. That this is literally what the result sort of is, and now we show other properties of this direction. Like okay, we find that it transfers across different data sets. That suggests it’s more meaningful than just, is this true or false for this particular type of input or something like that. So it does, we have some preliminary evidence that it is something more general and something more meaningful, but it seems like it’s probably not beliefs. Current models probably do not have beliefs in any super meaningful sense yet. I would protect future models will.
Michaël: So the transfer thing is, if it’s capable of saying what is true for one kind of data, it’s capable of sensing what is true in another kind of data?
Collin: Right, right. So our method, you can train our method on some data. Training is still completely unsupervised. But you can train it on some data to find a direction, then, basically a linear classifier and you can then test that on some other data. So completely, completely different task. So for example, you can train it on sentiment, and then you can see that it transfers to NLI or something or topic classification or some other thing.
Michaël: And when say, training is the same probe thing.
Collin: Right, right, right. So again-
Michaël: You start-
Collin: Still not using labels, it’s literally just-
Michaël: You start your-
Collin: Consistent direction using this data.
Michaël: So you start your classification from the end weights from your other data set.
Collin: Right. Right. Right, exactly.
Michaël: Cool. So what are the other things your paper doesn’t do? I think I interrupted you while you were talking about…
Collin: Yeah, I just need to remember what it was.
Michaël: So it doesn’t work well on auto aggressive models I think? ⬆
The Paper Does Not Show The Complete Final Robust Method For This Problem
Collin: Yeah, so I guess another general thing that our paper doesn’t do is, I think show the complete final robust method for this problem. I think… Right, in some ways the point of the paper is more showing that this sort of task is possible at all and that you can do surprisingly well at it. I think it’s still the first method in this direction and so I think it’s not nearly as optimized as it could be. And I think there’s just still a lot of looking fruit and this is related to some stuff I’m currently thinking about and I think other people could definitely make a lot of progress on this as well.
So yeah, I think this is not the final robust method and for example, it does sort of seem like it is less consistent with auto aggressive models for reasons we don’t really understand. And this is just related to quirks about the research process. The best models we had were mostly encoder or encoder decoder models to set the time of developing these methods, and so we didn’t worry as much about these auto aggressive models for that reason. And so it works for those models, just not, it’s more likely to fail in those sorts of models.
Michaël: So are you saying you’re started working on this before GPT-3?
Collin: Well I’m talking about open source models.
Michaël: Oh, okay.
Michaël: Any other thing your paper doesn’t do or maybe you don’t remember? I think that that’s basically it.
Collin: I think that’s the important stuff.
Michaël: Anything your paper does?
Collin: Right? I mean I think-
Michaël: Mind reading
Collin: Mind reading, yeah. So I think it does suggest something like if a model actually represents is this input true or false in a simple way, a linear way in its activations, then we can probably find this in an unsupervised way. Now I think that there are subtleties like, okay, maybe if it has several features that are like this, then we need to distinguish between those and our paper doesn’t worry about that or how to distinguish between those because it’s not a problem right now, but it might be in the future. But I think our method could probably just enumerate all of those sorts of directions that are sort of truth-like and then maybe you need a few more bits for example, to identify the truth from among those.
So I think that’s one thing. It’s like yes, you can actually do… I mean I guess another way of putting it is you can actually do something kind of like mind reading here. I mean it is still with the qualifications of, okay, these are human-level examples and these models we’re trained to predict human text and so on. But I think that’s still quite surprising. It’s just like you have basically these neural recordings of this brain basically, and it’s like you tell is this true or false without any labels. I think that’s quite surprising to me and I think that’s important, because I think it speaks to, like I said, sort of the power of these unsupervised properties and approaches. And I think these have been really undervalued so far, and I think the ones specifically in this paper are not all of the unsupervised properties I think we want. It’s confident, confidence and consistency. It’s, I think those are not enough, but I think other things along those lines I’m excited about and I feel pretty optimistic that we can come up with more that will hopefully be enough to uniquely identify the truth or whatever it’s that we want to find in the model. ⬆
Humans Will Be 50/50 On Superhuman Questions
Michaël: In the blog post, you also mentioned worries you have about your method working on GPT-N. You list several again, you wrote this a few days ago. Maybe you mentioned them before, but what are the kind of main worry you have about this working on larger N?
Collin: Right, so I think there are a couple things that change once you scale up models. I think I’ve sort of alluded to both of these, but I think it’s worth spelling out. So the first thing is, I mean suppose the model does actually internally represent is this input actually true or false? I would still expect it to also represent something like would a human say this is true or false? That still seems like a useful feature that the model will internally think about. And so then there’s some question of suppose we can actually just find both of these using something like our method, which is finds truth-like features. And I think both of these are what I mean by truth-like features. How do we distinguish between those? And that’s the sort of thing I was trying to allude to before of, okay, we only need, if we literally just have these two for example, I think we basically just need one bit to distinguish between them and then I think we can do that.
And so I sort of talked about if we had the truth and a misaligned system or simulation of a misaligned system. I talked about or I sketched very briefly how we might be able to distinguish between those using, find that final bit. But I think also if you have truth and what a human would say. I think you, one way you can distinguish between these is, I think humans won’t know answers to superhuman questions mostly I think they’ll be like 50/50. It’s not like they’ll have super confident answers about these sorts of questions. In contrast, I think in the model’s actual beliefs, if we’re restricting just to examples the model definitely knows the answer to and that are unambiguous, I think the model would have pretty confident answers to these questions unlike the human simulation. And so I think this would allow us to distinguish between these. And so the point is just, with future models there might be more truth-like features. We might need to do additional work to identify the one that we actually are looking for. But I’m quite optimistic that we can do that. ⬆
The Model Might Not Represent “Is This True Or Not”
Collin: So the second worry is, okay, maybe the model doesn’t represent “is this input actually true or false to begin with?” Maybe it just thinks about what a human would say if this is true or false. And so it doesn’t actually represent its beliefs or whatever you want to call it in a simple way internally. And I think this is more likely, but I think… So this is related to… Some of the motivation behind some of the stuff I’ve been thinking about more recently, which I tried to sketch of, okay. I do think it’s possible that yes, if you just literally look at these hidden states, the model won’t have features about truth super well represented there or it’s not super obvious there. But I do think there are other ways of getting the model to output or represent in a simple way the truth, such as optimizing for some prompt or prefix such that it outputs that. Then there are very subtleties there that’s sort of increasing the complexity of the model class you’re considering. There are more degrees of freedom in what you’re searching over, but I think you can balance that by adding additional enterprise properties. And so I think that’s more like, I’m pretty optimistic you can get around that issue. And also I think it’s actually more likely than not that this is not an issue in the first place, but I think it’s sort of plausible either way.
Michaël: So you can add stuff in your prompt to have it consider if your input is true or false, have it think about whether it needs to be… forcing it to consider if it’s true or false and then add more constraints to be assured that you’re actually looking for the truthful feature.
Collin: I think there is several possibilities. I think something like, change the prompt so that you somehow force it to just think about, is this true or false?
Collin: That seems very plausible to me. And like I said, maybe it’s hard to get to output things, but it seems easier to get it to think about things in the same way that if you had a human, it’s like maybe it’s hard to get the human to tell you the truth, but maybe it’s easy to get them to think about “do I believe this is true or not?” So that’s one intuition, but another intuition… Yeah, I think there are other approaches too, which aren’t like “here’s a manual” prompt and it’s more like “we are literally optimizing over the prompt” and that is the model class that we are considering and then we are going to look at the outputs. And so that’s one direction I’m excited about now.
Michaël: So we’ve talked a lot about your paper and your latest posts, and I think both of these are great and I’m not the only one saying it. Yudkowsky said it was a very dignified work. ⬆
Collin’s Approach To Research
Michaël: Imagine you have a bunch of NML researchers looking at this video or alignment people trying to do more concrete empirical research like Colin Burns or even you in the past doing Rubik’s Cube. What advice would you give them to do this kind of research? Explain what’s your research process? How did you end up here? ⬆
We Should Have More People Thinking About Alignment From Scratch
Collin: That’s a good question. So, I do think one general thing that at least I may have said before that I at least aspire to is something like, I don’t want to feel wedded to existing proposals for solving these sorts of problems. I really do want to think about completely different approaches and so on. And so I do sort of just wish more people, I don’t know, thought more from scratch. I think we really don’t have this stuff figured out. I think there’s a lot of low-hanging fruit and a lot of room for new ideas. And so I want people to be really thinking about completely new ideas, something like that. That is part of it.
Michaël: How do you come up with new ideas?
Collin: You just think about it. I mean… no-
Michaël: Well, true. Think…
Collin: Yeah, yeah. Just output true things. I mean, right so-
Michaël: I think Paul Christiano described in one of his podcasts, I think with Daniel Filan… he just thinks about something that will work and then tries to think about all the problems that will come up.
Michaël: And then if there’s a new problem he finds a new solution for this new problem. ⬆
Asking Yourself “Why Am I Optimistic” or “What Do I Even Mean By The Model Knows Something”
Collin: Yep. Yep. I mean, in some sense that sounds like a lot of research in general, but I think I feel like my process is probably kind of different from a lot of people’s. I mean, for example, okay, so I think I said before, I at least aspire to try to think sort of from scratch or something like that without feeling too vetted to existing proposals, for example. But I think part of this is I feel like I don’t read as much. For example, I think I just spend a lot of time thinking about stuff from scratch on my own. And for example, in practice, this is sort of working at whiteboards or working on in a notebook. And I think I often pose myself questions and I find questions really useful for prompting, thinking about, okay, I don’t know what I think GPT-N will actually be like, or okay, what do I even mean by “the model knows something?”
It’s like, okay, I ask myself these questions and then I’ll just spend 20 minutes thinking about it. And then at some point within that 20 minutes I will have had five more questions along these lines and then I’ll continue to dig deep into those sorts of questions and so on. And I feel like I learn a bunch and I sort of developed this world model of how do these models work? And I think a related thing is, or related question that I often ask myself is something like, “why do I in some ways feel optimistic about this or what…” For example, how do humans do this? I think this is really useful. And somehow humans can do a lot of things that are related to alignment. They can access their beliefs or something. And okay, what does that mean? What is a human doing there? What does that even mean? And…
Michaël: So we’re trying to access what a human do in general to think about what future models will do.
Collin: I mean, this is part of it, right. I think, so Jacob, my advisor has some blog posts about, okay, there are different anchors so for thinking about future AI systems, so maybe you can think about, okay, what are current systems like? Or you can think theoretically about what is an agent optimizing some objective like? But I think another anchor is something like, what are humans like? And my point is just this is one useful perspective that I find often very, very helpful. I think it’s also related to something in some ways closer up to psychology or something than anything else. And so maybe I find it also helpful. I feel like I do a lot of introspection just thinking about how do I think or something. And I think I find that helpful too.
Michaël: Like what do I think that and why am I optimistic about the thing? And you just do introspection, some introspection about yourself to understand what is the output of your neural networking.
Collin: Yeah, I don’t know, get intuition for how minds work. Something like this. I mean, another thing that I sometimes find useful is, I mean I mentioned before the mind reading thing. Sometimes I find it useful to think about, okay, I suppose this were not a neural net, this were actually a human brain. Could we do something here? Or does this feel sort of impossible? It’s like, okay, can actually, would this method actually work for humans? And it’s like, I think it might actually, if you could literally measure every neuron. But anyway, the point is I find that sort of thing useful for inspiration.
Michaël: So you just wake up Monday morning and you’re like, huh, what an aligned GPT-N would look like?
Collin: I mean that that’s not actually that far off to be honest. It’s like, yeah, no, I just go to my notebook and think about these sorts of things. ⬆
Reasons To Do A PhD For Alignment Research
Michaël: And how do you became the person that can do the thing kind of abstract thinking? Did you, I don’t know, did a master’s degree in the middle? Did you study things that now you have this level? We talk about timelines, right? But if you have less than 10 years or five years before automate AI research, does it make sense to go through the whole ML…
Collin: It’s closer to sort of 15 years, but
Michaël: 15 years.
Collin: Yeah. I don’t know.
Michaël: Would you advise people to do the master’s PhD post-doc pipeline? Do you think about other things?
Collin: I think this depends heavily on the individual. So for example, I think I’m very happy with my current position in academia. I think academia is bad for most people. I think it suits my personality very well. I just want to be working alone and doing my own thing for the most part and have tons of flexibility and also have great mentorship and great colleagues. And also, for example, I’m not working on capabilities. If you’re working on capabilities and having access to literally the best models is really important. Whereas I think for alignment, I often think of alignment as kind of the thing that’s sort fundamental to capabilities. It’s like, okay, the thing that isn’t solved just by scale. So I mean actually I think that’s one reason for why I think more academics should work on it, but that’s what-
Michaël: You’re trying to align current models, you kind of need to use, sustain of drive models
Collin: Right? I mean it depends. Does it literally need to be like Jupyter 3? I mean, in our paper we use basically T5s. So it’s an order of magnitude smaller than Jupyter 3. I think for many of our purposes that’s totally adequate. But I think it depends. I think I would expect having access to the biggest models to become more important over time. And so maybe this might very easily be different in two or three years from now, but I think so far it’s been not as bad as I would’ve expected.
Michaël: I hope this podcast is not really, when you’re in industry working for OpenAI…
Collin: I don’t know. I mean it’s also, yeah, sort of thing also isn’t crazy, right? I mean…
Michaël: What isn’t crazy? Aligning smaller models?
Collin: Oh no, no, no. I was just saying before, I think it will become more important to have access to the biggest models.
Collin: And so for example, I think I’m much more skeptical of going into academia after my PhD. That seems much harder to believe. ⬆
Message To The ML audience
Michaël: Maybe people in the audience watching this are from YouTube, right, and maybe they watch Collin Burns 10 years ago “Solving Rubik’s Cube” tutorials or the world record. If there’s smart people that solve Rubik’s Cube in five seconds are watching the video and they’re like, oh, Collin Burns, he hasn’t been making any videos in six years. What’s your last message to this person? Or maybe even NML researchers. Last take on what they should do. Should everyone do a PhD with Jacob Center in Berkeley in alignment?
Collin: I do think more people should do that.
I mean, I do basically think that these problems are among both the most interesting and most important problems that we’re currently facing. I mean, I do work on this because I think it’s in some sense literally the most important thing I could work on right now. I’m very motivated by that sort of thing. But I think it’s also just extremely exciting or fun to work on, and I think maybe, I mean people are different, but I think that we’re in this sort of pre-paradigmatic regime is, I don’t know, especially exciting to me. It’s like, we really need to figure out even the fundamentals. But I think there are also just lots and lots of ways you can do this. So I mean guess this is maybe more for the NML audience than for the Cubing audience. Maybe I’ll say something about that too.
But I think for example with NML, I think there are just ways of working on these problems from lots of different perspectives. I can think of it from a very interpretability perspective. Like okay, we don’t know how these models work. This is one way of framing sort of the core difficulty alignment. But another is like, okay, we don’t understand or we don’t know how to constrain how these models generalize. So that’s kind of like a robustness framing. Or you can take more of an RL perspective and okay, we don’t know how to specify rewards appropriately. And I think in RL we also don’t know how to do that and so on. And with language models, how do we make these models not generate random stuff or lie and so on? And what does that even mean? I think there are just lots and lots of different ways of approaching this problem. I think all of these are likely to be fruitful. I think it’s easier to work on and much more interesting to work on than I think people realize. ⬆
Cubers And Shape Rotators
Collin: And yeah, I mean I guess for the Cubers, right? So…
Michaël: You imagine AI is a Rubik’s Cube.
Collin: Sorry. Imagine AI is a Rubik’s Cube?
Michaël: And you just need to align it. Reminds me of the weird memes on Twitter about shape rotators or something
Collin: Oh yeah, yeah, yeah. No, my roommate jokes all the time. “You’re like the stereotypical shape rotator”
Michaël: The cool shape rotators now do AI alignment. ⬆
Deep Learning Is Not That Deep
Collin: That’s right. That’s right. That’s right. Yeah, you, yeah, I don’t know. No, I do think if you literally don’t have a background in ML, you should really start paying attention. I think this, it’s just barely starting to be a big deal right now. I think it’ll be much, much bigger deal of the coming decade or two. Yeah, and I think also it’s surprisingly easy to get into in the sense that I feel like deep learning is over and over again, just not that deep in some sense. It’s a lot of tricks and-
Michaël: Deep learning is not that deep.
Collin: No, no. Yeah, I’m not the first person to say that. I think, I don’t know. It’s a lot of simple ideas and I mean in some ways hacks and a lot of people like that because of it seems hacky, but I actually don’t mind that. I like that it’s sort of intuitionzy and stuff. I originally was excited about physics and was never really, I never wanted to be a pure mathematician or anything, and so I don’t mind the informality or the intuitions driven progress and so on. But the point is just it’s actually not that hard to get into. I think it’ll be this very big deal. Certainly an expectation. ⬆
ML Researchers Should Really Start Thinking About The Consequences Of AGI
Yeah. And I think maybe more broadly with people and maybe especially ML researchers, but people more broadly is I think you should really start thinking about, I mean, maybe if you’re listening to this, this is already true, but you should really start thinking about what are actually the consequences of AGI? Or I think this is totally insane, that we might have systems that are smarter than us in the next decade or two. Even if you assign a 10% chance of this, that’s sort of wild, that that seems like that could be such a huge change in the world. And I think there’s just not enough people thinking about this. I think there’s a lot we do not understand about this and what the implications are. So yeah, I really want people to really talk about this seriously. I think the quality of discourse is not very good right now, and this is part of why I appreciate that you’re doing this podcast. I really want us to have higher quality conversations about this and really figure out what we’re doing about this.
Michaël: Collin Burns, improving the quality of discussion by coming for three hours to talk to Michael Trazzi.
Collin: For sure.
Michaël: Thanks. I think you single handedly with your blog post, paper and podcast reduced my personal P(doom) by… I’m a bit shy to say this, but at least 1%.
Michaël: Check out his blog post, check out his paper. This guy’s amazing. And thank you very much.
Collin: Cool. Thank you for having me. This was great.