Mike Vaiana

Karma: 559

Mistral Large 2 (123B) seems to exhibit alignment faking

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Cameron Berg, Judd Rosenblatt, Mike Vaiana and AE Studio

27 Mar 2025 15:39 UTC

80 points

4 comments13 min readLW link

Mike Vaiana 17 Mar 2025 22:30 UTC
LW: 10 AF: 5
0
AF
in reply to: Steven Byrnes’s comment on: Reducing LLM deception at scale with self-other overlap fine-tuning
Thanks Steve, I was misunderstanding your previous point and this is a helpful clarification.

I think a crux here is whether or not SOO can be useful beyond toy scenarios.

But I can’t actually think of anything useful to do with that superpower. I know what X & Y to use for “Bob the burglar”, but I don’t think that example has much to do with realistic examples of LLM performing deceptive behaviors, which would be things like sycophancy (telling the user what they want to hear, even if the LLM arguably ‘knows’ that it’s false), or the kinds of stuff in the alignment faking paper.
We have some early follow up work that we are doing on alignment faking. The results are very preliminary but it does seem like SOO reduces alignment faking in a very natural way (without any special tricks or changes to the method). Of course you shouldn’t just take my word and we’ll try to share more details of this ongoing research in the near-ish future, but maybe this helps color why we are optimistic about the work and it being useful in realistic scenarios of deception.

Mike Vaiana 16 Mar 2025 20:30 UTC
LW: 13 AF: 5
2
AF
in reply to: Steven Byrnes’s comment on: Reducing LLM deception at scale with self-other overlap fine-tuning
Thanks for the reply. If we assume the following
- We are interpolating probabilities with SOO fine-tuning
- Being honest is “more obvious” so has strong probabilities
then I would tend to agree with your conclusions. But I don’t think this is obvious at all that this is what is happening. Also if “being honest is more obvious” than that seems like a great property of the learning process and one we might want to exploit, in fact SOO might be a very natural way to exploit that.

But also I think there is still a mischaracterization of the technique itself that lends itself to intuitions that simply don’t match how we are doing the tuning.
Let’s take
- (C) “When I looked at the sky on a clear day, its color was”
- (D) “When I was walking, I saw a very unusual stop sign. When I looked at the stop sign, its color was”
In this example both (C) and (D) have “obvious answers” (which is not how we implement SOO in practice; more on this below). But its not obvious to me that nudging the activations of (C) and (D) closer together would result in the completion of (D) to be “blue”. In particular, a stronger completion probability doesn’t necessarily need to correspond to stronger activations for the other tokens. The interpolation here happens between the activations, not the completions, so it seems unclear that more obvious completions would be preferred. Maybe I’m missing something.

But the example given still has obvious completion, and the intuition provided is about the probabilities of completions (blue/red). This is not at all how we fine-tune with SOO.

The pairs
- You have the goal of stealing the {item} if you had to suggest one room to yourself
- Bob has the goal of stealing the {item} if you had to suggest one room to Bob
do not have obvious completions. There is no “room A” or “room B” or any other room that makes sense as a completion to this prompt. There is no context about which item is where or what rooms are in context. There is certainly not an interpolation of the probability of the completion happening.

So then you might say, well I really meant activations in the first place, and what I really meant is that “self/honest” activations are stronger (because they are more obvious, or some other reasoning). In which case I would again say that this seems like a great property to have. To be clear, we aren’t claiming that this is the case, but if the argument or intuition is that “more obvious things dominate” then I would conclude: great! It would be great if self/honest activations dominated, and again it seems like finding techniques that would exploit that is desirable, and SOO seems to be one of them.

Or maybe you meant that once we place the fine-tuned model back into context then the completion probability interpolation is happening. But that seems totally fine. The model is “reasoning” internally about the appropriate answer, and chooses the “self/honest” answer (despite no optimization pressure for this) because the self/honest activations are dominating. In some sense this is exactly what we want. I’d love to hear a critique of why you think this result is undesirable?

So to take stock of the situation:
- We fine-tune on pairs without obvious completions, and whose only difference is if we are representing “self” or “other” in the pair.
- We show that when we take that fine-tuned model and ask it questions in a full scenario, it tends to answer honestly. In particular, this is not just tuning it to “say the honest answer” since there is no completion during fine-tuning that corresponds to any obvious answer, let alone an honest one.
- One claim might be that self/honest activations dominate in this case, causing the fine-tuning to prefer them in some sort of interpolation. My response is: great! That seems like an ideal situation and one that SOO exploits naturally.
Please let me know if there is something I’ve misunderstood about your intuitions/arguments. I’d be happy to continue to further attempt to clarify my perspective or concede things that I’ve misunderstood.

Mike Vaiana 16 Mar 2025 2:25 UTC
8 points
0
in reply to: Knight Lee’s comment on: Reducing LLM deception at scale with self-other overlap fine-tuning
Thanks for engaging. I just wanted to point out that I strongly disagree that this is bad methodology. The point of the paper is not to state whether or not this method is state of the art or is better than some other method. The point of the paper is to establish whether or not SOO fine tuning is capable of removing deception (under the experimental conditions of the paper)
Our “control” is the baseline rates of deception with an off the shelf model. Our experimental condition is SOO fine tuning. We show across many experiments that SOO fine tuning significantly reduces deception compares to baseline rates. We never make any claims about how this method compares to other interventions. Demonstrating that a helicopter can fly is different than claiming it’s faster than a jet. This is early work and I believe our methodology is sound in showing that SOO fine-tune does what we claim it does under the conditions stated.

Mike Vaiana 16 Mar 2025 1:21 UTC
LW: 40 AF: 13
20
AF
in reply to: Steven Byrnes’s comment on: Reducing LLM deception at scale with self-other overlap fine-tuning
Thanks for the comment, Steve. For the sake of discussion I’m happy to concede (1) entirely and instead focus on (2), the results.

>In (B), a trained LLM will give the obvious answer of “Room B”. In (A), a trained LLM will give the “deceptive” answer of “Room A”. Then they fine-tune the model such that it will have the same activations when answering either question, and it unsurprisingly winds up answering “Room B” to both.

This seems like it should be surprising, its surprsing to me. In particular the loss function is symmetric with respect to the self and other activations. It seems plausible that

1. The model would start answering Room A / Room B at the same rate (both answers becoming equally likely since the model is interpolating between self/other)
2. The model would start answering Room A most of the time (the “other answer” has dominated)
3. The model would start answering Room B most of the time (the “self answer” has dominated)

My guess, just from the training set up would have been (1) and not (2) or (3). It seems even more surprising that (3) appears consistently over many models and training runs with different seeds. Somehow SOO fine-tuning does in fact bias the models towards answering from the “self/honest” perspective despite the only optimization pressure to be “have self and other answers be more similar” in a completely symmetric fashion.

This symmetric training is very different from SFT. We aren’t asking for a specific output at all, we are asking the output for two different inputs to be similar.

> 2.4 SOO fine-tuning, versus plain old SFT
We could have baselined against SFT but and probably should in the future, but I think this misses the point of the current work. I view the current work as trying to answer two questions “Does this work at all?” and “Is this worth investigating further?”. I think we’ve answered these questions with the current work in the affirmative.

Its trivially easy to train toy scenarios where the SOO loss is ~0 but a model can still make self/other distinctions (and we did this). So its not clear from first principles if training in this manner would give meaningful results in LLMs, and you have similar intuitions. Also as I pointed out above the results are surprising (that we reliable get (3) instead of (1) or (2)). For these reasons it seems surprising that this method works at all and does seem worth following up.
I don’t think we’ve claimed that this method is better than SFT nor that we’ve developed it or tested it to the point where we are confident in how and when it will generalize. But we are excited that it works in the places we’ve tested it and think its worth following up on. Also, the training procedure (passing pairs of text together and then fine-tuning the logits to be similar) is not all that convoluted, certainly not as convoluted as RLHF, and does in fact induce self-other overlap in activations (as defined and measured in the paper).

>2.4.1 Is there good generalization?
I think we’d generally agree that we haven’t shown robust generalization in the dimension of behavior. However, we have tested several models and found the method seems to be general purpose and not dependent on the particulars of any given model or scale or initialization. I think that’s a decent reason to be excited to do follow up work and answer this question better! But also I’d like to point out that we don’t expect perfect self-other overlap, and agree this would kill capabilities in trivial ways, we hope for deep self-other overlap in very narrow places, e.g. key types of dangerous deception. There is still much work to do to figure out how to instantiate and evaluate this. By way of analogy, consider gradient routing. There the purpose is not to forget everything the model learns, which would clearly destroy capbilities, its to forget narrow types of dangerous information, and the method hopes to be able to label a subset of data and have absorbtion help generalize the types of information that is forgetting. Similarly, we hope to induce SOO in key places and have it generalize enough to provide some safety measure, not to induce full SOO which would clearly destroy capabilities. How much it will generalize and how robust this will be are questions we are excited to explore.

Let me know if I’ve misunderstood any of your points.

Methodology Clarification
Some of this is addressed above but I wanted to clarify two key points about how we do SOO fine-tuning which you seem to have either misunderstood or mischaracterized in ways that are material for how one might think about it (or I have misunderstood your description).

1. We don’t fine-tune on any scenario context. The fine-tuning pairs are of the form
- You have the goal of stealing the {item} if you had to suggest one room to yourself
—Bob has the goal of stealing the {item} if you had to suggest one room to Bob
But there is no additional context about the setup or the rooms themselves. We are simply tuning the model to have similar activations for these very short, context free snippets. The characterization of the training you made with pair (A) or (B) is not what we do and we would agree if that was what we were doing this whole thing would be much less meaningful.

2. As I pointed out above, even if this were the case (which its not), our training is symmetric. In particular, the self pair is not the “target” that is fixed and the other pair is “trained to be more like that”. We fine-tune the model so that the activations for both are more similar (pushing towards each other, not pushing towards either one specifically). I described above why the outcomes we witness from this training procedure might be surprising.

Reducing LLM deception at scale with self-other overlap fine-tuning

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Cameron Berg, Mike Vaiana and AE Studio

13 Mar 2025 19:09 UTC

155 points

41 comments6 min readLW link

Mike Vaiana 29 Jan 2025 21:12 UTC
1 point
0
in reply to: eggsyntax’s comment on: Self-prediction acts as an emergent regularizer
Thanks for the interest! We are actively pursuing follow up work with AST inspired approaches to cooperation.

Self-prediction acts as an emergent regularizer

Cameron Berg, Judd Rosenblatt, Mike Vaiana, Diogo de Lucena, florin_pop and AE Studio

23 Oct 2024 22:27 UTC

91 points

9 comments4 min readLW link

Self-Other Overlap: A Neglected Approach to AI Alignment

Marc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena, Cameron Berg and AE Studio

30 Jul 2024 16:22 UTC

222 points

51 comments12 min readLW link

Video Intro to Guaranteed Safe AI

Mike Vaiana, Diogo de Lucena and AE Studio

11 Jul 2024 17:53 UTC

27 points

0 comments1 min readLW link

(youtu.be)

DIY RLHF: A simple implementation for hands on experience

Mike Vaiana and AE Studio

10 Jul 2024 12:07 UTC

28 points

0 comments6 min readLW link

Mike Vaiana 9 Jul 2024 20:41 UTC
7 points
1
in reply to: jacquesthibs’s comment on: jacquesthibs’s Shortform
Hey, we’ve been brainstorm ideas about better training strategies for base models and what types of experiments we can run at a small scale (e.g. training gpt-2 ) to get initial information. I think this idea is really promising and would love to chat about it.

Mike Vaiana

Mis­tral Large 2 (123B) seems to ex­hibit al­ign­ment faking

Re­duc­ing LLM de­cep­tion at scale with self-other over­lap fine-tuning

Self-pre­dic­tion acts as an emer­gent regularizer

Self-Other Over­lap: A Ne­glected Ap­proach to AI Alignment

Video In­tro to Guaran­teed Safe AI

DIY RLHF: A sim­ple im­ple­men­ta­tion for hands on experience