This thought might be detectable. Now the problem of scaling safety becomes a problem of detecting [...] this kind of conditional, deceptive reasoning.
What do you do when you detect this reasoning? This feels like the part where all plans I ever encounter fail.
Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that’s unlikely to get rid of it. I think at this point the natural answer is “yes, your systems are scheming against you, so you gotta stop, because when you train against it, you are probably primarily making it a better schemer”.
I would be very surprised if you have a 3-month Eliezer that is not doing scheming the first time, and training your signals away is much easier than actually training away the scheming.
Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that’s unlikely to get rid of it.
I’m not necessarily asserting that this solves the problem, but seems important to note that the obviously-superior alternative to training against it is validating against it. i.e., when you observe scheming you train a new model, ideally with different techniques that you reckon have their own chance of working.
However doomed you think training against the signal is, you should think validating against it is significantly less doomed, unless there’s some reason why well-established machine learning principles don’t apply here. Using something as a validation metric to iterate methods doesn’t cause overfitting at anything like the level of directly training on it.
EDIT: later in the thread you say this this “is in some sense approximately the only and central core of the alignment problem”. I’m wondering whether thinking about this validation vs training point might cause you a nontrivial update then?
Using something as a validation metric to iterate methods doesn’t cause overfitting at anything like the level of directly training on it.
Validation is certainly less efficient at overfitting but it seems a bit like using an evolutionary algorithm rather than gradient descent. You aren’t directly optimizing according to the local gradient, but that doesn’t necessarily mean you’ll avoid Goodharting—just that you’re less likely to immediately fall into a bad local optimum.
The likelihood of preventing Goodharting feels like it depends heavily on assumptions about the search space. The “validation” filters the search space to areas where scheming isn’t easily detectable, but what portion of this space is safe (and how can we tell)? We don’t actually have a true globally accurate validator oracle—just a weak approximation of one.
I roughly agree, but it seems very robustly established in practice that the training-validation distinction is better than just having a training objective, even though your argument mostly applies just as well to the standard ML setup.
You point out an important difference which is that our ‘validation metrics’ might be quite weak compared to most cases, but I still think it’s clearly much better to use some things for validation than training.
Like, I think there are things that are easy to train away but hard/slow to validate away (just like when training an image classifier you could in principle memorise the validation set, but it would take a ridiculous amount of hyperparameter optimisation).
One example might be if we have interp methods that measure correlates of scheming. Incredibly easy to train away, still possible to validate away but probably harder enough that ratio of non-schemers you get is higher than if trained against it, which wouldn’t affect the ratio at all.
A separate argument is that I’m think if you just do random search over training ideas, rejecting if they don’t get a certain validation score, you actually don’t goodhart at all. Might put that argument in a top level post.
A separate argument is that I’m think if you just do random search over training ideas, rejecting if they don’t get a certain validation score, you actually don’t goodhart at all. Might put that argument in a top level post.
I’d be interested in seeing this argument laid out.
I think that if you agree “3-month Eliezer is scheming the first time” is the main problem, then that’s all I was trying to justify in the comment above.
I don’t know how hard it is to train 3-month Eliezer not to scheme, but here is a general methodology one might purse to approach this problem.
The methodology looks like “do science to figure out when alignment faking happens and does not happen, and work your ways toward training recipes that don’t produce alignment faking.”
For example, you could use detection tool A to gain evidence about whether training against detection tool B actually removes alignment faking.
To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don’t game them so hard).
You might also find other ways to avoid alignment faking that don’t rely on training against detection tools. For example, perhaps if you train models to be highly corrigible and chill from the start, they tend not to fake alignment -- I don’t know.
I’m uncertain about how likely alignment faking will be by default, or how hard it is to mitigate. I think these are messy empirical questions.
However, I’m more confident that there are principled empirical methodologies we can apply to make progress on these questions. I don’t think my explanation here justifies this claim -- but I’m hoping to publish posts that go into more detail soon.
To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don’t game them so hard).
How do you iterate? You mostly won’t know whether you just trained away your signal, or actually made progress. The inability to iterate is kind of the whole central difficulty of this problem.
(To be clear, I do think there are some methods of iteration, but it’s a very tricky kind of iteration where you need constant paranoia about whether you are fooling yourself, and that makes it very different from other kinds of scientific iteration)
agree that it’s tricky iteration and requires careful thinking about what might be going on and paranoia.
I’ll share you on the post I’m writing about this before I publish. I’d guess this discussion will be more productive when I’ve finished it (there are some questions I want to think through regarding this and my framings aren’t very crisp yet).
FWIW, at least in my mind this is in some sense approximately the only and central core of the alignment problem, and so having it left unaddressed feels confusing. It feels a bit like making a post about how to make a nuclear reactor where you happen to not say anything about how to prevent the uranium from going critical, but you did spend a lot of words about the how to make the cooling towers and the color of the bikeshed next door and how to translate the hot steam into energy.
Like, it’s fine, and I think it’s not crazy to think there are other hard parts, but it felt quite confusing to me.
I just don’t actually think many people agree that it’s the core of the problem, so I figured it was worth establishing this (and I think there are some other supplementary approaches like automated control and incentives that are worth throwing into the mix) before digging into the ‘how do we avoid alignment faking’ question
Here is a quote from the post describing what to do in that situation:
These automatic control measures only raise the alarm. What happens after the alarm is raised? The developer might also implement autonomous incident response:
Resampling: Force defective M_1 agents to take different actions.
AI agent elimination: Delete suspicious agents in favor of other (hopefully decorrelated) unsuspicious ones.
Fine-tuning: Train agents not to take suspicious actions again.
I’m curious about how fine-tuning away scheming thoughts could still lead to scheming. Here are two possible outcomes:
Fine-tuning actually robustly reduces scheming behavior (as intended) maybe because tokens and actions associated with scheming can be reliably detected and these tokens are required for scheming in the CoT.
Fine-tuning reduces the appearance of scheming but doesn’t really reduce scheming: in this case, the fine-tuning reduces the probability of certain words or trajectories but the underlying ‘concept’ of scheming remains and the model merely rephrases its thinking in a way that is not detected as scheming and uses different tokens.
For scenario 2, maybe you need something like activation engineering to change the underlying thought process rather than just token-level descriptions of it.
At this point I would direct the “deferred task” apparatus fully towards interventional interpretability. Put a moratorium on further gradient-based training, which is not well understood and can have many indirect effects unless you have some understanding of modularity and have applied stop gradients almost everywhere that is irrelevant to the generator of the conditional, deceptive reasoning behavior. Instead halt, melt and catch fire at that point.
Halt further model deployments towards the original deferred task. Quarantine the model that first exhibited the reasoning and promote it to a model organism (never deployed further, only studied by other earlier models). Direct the apparatus (from earlier checkpoints before you reached this stage) towards surgically reducing it while preserving mechanisms relevant to the original deferred task.
In the frame of enumerative safety, this would look something like identifying which circuits and features are minimal to the original deferred task (though this hinges upon task-minimal circuits being deception-free), evaluating the identified regions for mechanism preservation, localization and minimality and run-time ablating the rest.
There is probably a much better frame, but this is a new start to the “induction base case” in a completely separate invocation of the original apparatus. There are other permutations of spinning up new deferred tasks, such as earlier models applying Microscope AI on the quarantined model organism to gain insight on how we got to this point, without any direct attempt to remediate the issue, which as you note is “unlikely to get rid of it” despite one naive attempt above.
If your spinoff deferred tasks lead to the procurement of provable statements or heuristic arguments that the quarantined model cannot be minimized without “avoiding this kind of conditional, deceptive reasoning,” abandon the original deferred task and switch to a different plan. Otherwise, restart the apparatus towards the original deferred task when you have the proof artifacts.
There are a lot of concerns you could raise with this additional structure but it seems like a distinct problem that requires a separate rebuttal rather than a hard stop fail? The obvious one is that these sorts of spinoff deferred tasks could be harder than the original task and consistently lead to the same failure mode, a la “exception thrown while handling previous exception.”
What do you do when you detect this reasoning? This feels like the part where all plans I ever encounter fail.
Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that’s unlikely to get rid of it. I think at this point the natural answer is “yes, your systems are scheming against you, so you gotta stop, because when you train against it, you are probably primarily making it a better schemer”.
I would be very surprised if you have a 3-month Eliezer that is not doing scheming the first time, and training your signals away is much easier than actually training away the scheming.
I’m not necessarily asserting that this solves the problem, but seems important to note that the obviously-superior alternative to training against it is validating against it. i.e., when you observe scheming you train a new model, ideally with different techniques that you reckon have their own chance of working.
However doomed you think training against the signal is, you should think validating against it is significantly less doomed, unless there’s some reason why well-established machine learning principles don’t apply here. Using something as a validation metric to iterate methods doesn’t cause overfitting at anything like the level of directly training on it.
EDIT: later in the thread you say this this “is in some sense approximately the only and central core of the alignment problem”. I’m wondering whether thinking about this validation vs training point might cause you a nontrivial update then?
Validation is certainly less efficient at overfitting but it seems a bit like using an evolutionary algorithm rather than gradient descent. You aren’t directly optimizing according to the local gradient, but that doesn’t necessarily mean you’ll avoid Goodharting—just that you’re less likely to immediately fall into a bad local optimum.
The likelihood of preventing Goodharting feels like it depends heavily on assumptions about the search space. The “validation” filters the search space to areas where scheming isn’t easily detectable, but what portion of this space is safe (and how can we tell)? We don’t actually have a true globally accurate validator oracle—just a weak approximation of one.
I roughly agree, but it seems very robustly established in practice that the training-validation distinction is better than just having a training objective, even though your argument mostly applies just as well to the standard ML setup.
You point out an important difference which is that our ‘validation metrics’ might be quite weak compared to most cases, but I still think it’s clearly much better to use some things for validation than training.
Like, I think there are things that are easy to train away but hard/slow to validate away (just like when training an image classifier you could in principle memorise the validation set, but it would take a ridiculous amount of hyperparameter optimisation).
One example might be if we have interp methods that measure correlates of scheming. Incredibly easy to train away, still possible to validate away but probably harder enough that ratio of non-schemers you get is higher than if trained against it, which wouldn’t affect the ratio at all.
A separate argument is that I’m think if you just do random search over training ideas, rejecting if they don’t get a certain validation score, you actually don’t goodhart at all. Might put that argument in a top level post.
I’d be interested in seeing this argument laid out.
I wrote it out as a post here.
I think that if you agree “3-month Eliezer is scheming the first time” is the main problem, then that’s all I was trying to justify in the comment above.
I don’t know how hard it is to train 3-month Eliezer not to scheme, but here is a general methodology one might purse to approach this problem.
The methodology looks like “do science to figure out when alignment faking happens and does not happen, and work your ways toward training recipes that don’t produce alignment faking.”
For example, you could use detection tool A to gain evidence about whether training against detection tool B actually removes alignment faking.
To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don’t game them so hard).
You might also find other ways to avoid alignment faking that don’t rely on training against detection tools. For example, perhaps if you train models to be highly corrigible and chill from the start, they tend not to fake alignment -- I don’t know.
I’m uncertain about how likely alignment faking will be by default, or how hard it is to mitigate. I think these are messy empirical questions.
However, I’m more confident that there are principled empirical methodologies we can apply to make progress on these questions. I don’t think my explanation here justifies this claim -- but I’m hoping to publish posts that go into more detail soon.
How do you iterate? You mostly won’t know whether you just trained away your signal, or actually made progress. The inability to iterate is kind of the whole central difficulty of this problem.
(To be clear, I do think there are some methods of iteration, but it’s a very tricky kind of iteration where you need constant paranoia about whether you are fooling yourself, and that makes it very different from other kinds of scientific iteration)
agree that it’s tricky iteration and requires careful thinking about what might be going on and paranoia.
I’ll share you on the post I’m writing about this before I publish. I’d guess this discussion will be more productive when I’ve finished it (there are some questions I want to think through regarding this and my framings aren’t very crisp yet).
Seems good!
FWIW, at least in my mind this is in some sense approximately the only and central core of the alignment problem, and so having it left unaddressed feels confusing. It feels a bit like making a post about how to make a nuclear reactor where you happen to not say anything about how to prevent the uranium from going critical, but you did spend a lot of words about the how to make the cooling towers and the color of the bikeshed next door and how to translate the hot steam into energy.
Like, it’s fine, and I think it’s not crazy to think there are other hard parts, but it felt quite confusing to me.
I’m sympathetic to this reaction.
I just don’t actually think many people agree that it’s the core of the problem, so I figured it was worth establishing this (and I think there are some other supplementary approaches like automated control and incentives that are worth throwing into the mix) before digging into the ‘how do we avoid alignment faking’ question
I agree that this seems like a core alignment problem. The problem you are describing seems like a rephrasing of the ELK problem.
Here is a quote from the post describing what to do in that situation:
I’m curious about how fine-tuning away scheming thoughts could still lead to scheming. Here are two possible outcomes:
Fine-tuning actually robustly reduces scheming behavior (as intended) maybe because tokens and actions associated with scheming can be reliably detected and these tokens are required for scheming in the CoT.
Fine-tuning reduces the appearance of scheming but doesn’t really reduce scheming: in this case, the fine-tuning reduces the probability of certain words or trajectories but the underlying ‘concept’ of scheming remains and the model merely rephrases its thinking in a way that is not detected as scheming and uses different tokens.
For scenario 2, maybe you need something like activation engineering to change the underlying thought process rather than just token-level descriptions of it.
At this point I would direct the “deferred task” apparatus fully towards interventional interpretability. Put a moratorium on further gradient-based training, which is not well understood and can have many indirect effects unless you have some understanding of modularity and have applied stop gradients almost everywhere that is irrelevant to the generator of the conditional, deceptive reasoning behavior. Instead halt, melt and catch fire at that point.
Halt further model deployments towards the original deferred task. Quarantine the model that first exhibited the reasoning and promote it to a model organism (never deployed further, only studied by other earlier models). Direct the apparatus (from earlier checkpoints before you reached this stage) towards surgically reducing it while preserving mechanisms relevant to the original deferred task.
In the frame of enumerative safety, this would look something like identifying which circuits and features are minimal to the original deferred task (though this hinges upon task-minimal circuits being deception-free), evaluating the identified regions for mechanism preservation, localization and minimality and run-time ablating the rest.
There is probably a much better frame, but this is a new start to the “induction base case” in a completely separate invocation of the original apparatus. There are other permutations of spinning up new deferred tasks, such as earlier models applying Microscope AI on the quarantined model organism to gain insight on how we got to this point, without any direct attempt to remediate the issue, which as you note is “unlikely to get rid of it” despite one naive attempt above.
If your spinoff deferred tasks lead to the procurement of provable statements or heuristic arguments that the quarantined model cannot be minimized without “avoiding this kind of conditional, deceptive reasoning,” abandon the original deferred task and switch to a different plan. Otherwise, restart the apparatus towards the original deferred task when you have the proof artifacts.
There are a lot of concerns you could raise with this additional structure but it seems like a distinct problem that requires a separate rebuttal rather than a hard stop fail? The obvious one is that these sorts of spinoff deferred tasks could be harder than the original task and consistently lead to the same failure mode, a la “exception thrown while handling previous exception.”