To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don’t game them so hard).
How do you iterate? You mostly won’t know whether you just trained away your signal, or actually made progress. The inability to iterate is kind of the whole central difficulty of this problem.
(To be clear, I do think there are some methods of iteration, but it’s a very tricky kind of iteration where you need constant paranoia about whether you are fooling yourself, and that makes it very different from other kinds of scientific iteration)
agree that it’s tricky iteration and requires careful thinking about what might be going on and paranoia.
I’ll share you on the post I’m writing about this before I publish. I’d guess this discussion will be more productive when I’ve finished it (there are some questions I want to think through regarding this and my framings aren’t very crisp yet).
FWIW, at least in my mind this is in some sense approximately the only and central core of the alignment problem, and so having it left unaddressed feels confusing. It feels a bit like making a post about how to make a nuclear reactor where you happen to not say anything about how to prevent the uranium from going critical, but you did spend a lot of words about the how to make the cooling towers and the color of the bikeshed next door and how to translate the hot steam into energy.
Like, it’s fine, and I think it’s not crazy to think there are other hard parts, but it felt quite confusing to me.
I just don’t actually think many people agree that it’s the core of the problem, so I figured it was worth establishing this (and I think there are some other supplementary approaches like automated control and incentives that are worth throwing into the mix) before digging into the ‘how do we avoid alignment faking’ question
How do you iterate? You mostly won’t know whether you just trained away your signal, or actually made progress. The inability to iterate is kind of the whole central difficulty of this problem.
(To be clear, I do think there are some methods of iteration, but it’s a very tricky kind of iteration where you need constant paranoia about whether you are fooling yourself, and that makes it very different from other kinds of scientific iteration)
agree that it’s tricky iteration and requires careful thinking about what might be going on and paranoia.
I’ll share you on the post I’m writing about this before I publish. I’d guess this discussion will be more productive when I’ve finished it (there are some questions I want to think through regarding this and my framings aren’t very crisp yet).
Seems good!
FWIW, at least in my mind this is in some sense approximately the only and central core of the alignment problem, and so having it left unaddressed feels confusing. It feels a bit like making a post about how to make a nuclear reactor where you happen to not say anything about how to prevent the uranium from going critical, but you did spend a lot of words about the how to make the cooling towers and the color of the bikeshed next door and how to translate the hot steam into energy.
Like, it’s fine, and I think it’s not crazy to think there are other hard parts, but it felt quite confusing to me.
I’m sympathetic to this reaction.
I just don’t actually think many people agree that it’s the core of the problem, so I figured it was worth establishing this (and I think there are some other supplementary approaches like automated control and incentives that are worth throwing into the mix) before digging into the ‘how do we avoid alignment faking’ question
I agree that this seems like a core alignment problem. The problem you are describing seems like a rephrasing of the ELK problem.