joshc comments on How might we safely pass the buck to AI?

joshc 19 Feb 2025 23:42 UTC
LW: 37 AF: 9
−24
AF
Sure, I’ll try.

I agree that you want AI agents to arrive at opinions that are more insightful and informed than your own. In particular, you want AI agents to arrive at conclusions that are at least as good as the best humans would if given lots of time to think and do work. So your AI agents need to ultimately generalize from some weak training signal you provide to much stronger behavior. As you say, the garbage-in-garbage-out approach of “train models to tell me what I want to hear” won’t get you this.

Here’s an alternative approach. I’ll describe it in much more concreteness and specificity than I think is necessary because I suspect the concreteness is helpful for finding points of disagreement.

Short summary: Create AI agents that faithfully simulate top humans doing small amounts of work, such that these sims create faithful simulators of top humans doing large amounts of work.

e.g. create an AI that simulates Eliezer working for 3 months, that then creates a simulator of Eliezer working for 6 months, and then expands this horizon to 12 months, 24 months, and ultimately many decades

Why does this iterative approach buy you anything?

This process allows each Eliezer to only deal with a small distribution shift.

If each distribution shift is small, each Eliezer sim mostly needs to worry if the next Eliezer sim is faking alignment (why else would its behavior dramatically change between when it is ‘evaluated by N-month Eliezer’ and when it is ‘creating 2N-month Eliezer’?). Thus, so long as N-month Eliezer sims can (1) successfully create 2N-month Eliezer sims and (2) can ensure that 2N-month Eliezer sims are not faking alignment, the Eliezer sims continue to be faithful, and also become more capable.

Responding to objections

Here are some objections you might have:

But how would an Eliezer sim create an Eliezer sim more capable than itself?

How would a human create something smarter than a human?

There are many ways. Here’s one.

The 3 month Eliezer sim might spin up many copies of other 3 month Eliezer sims, which together produce outputs that a 6-month Eliezer sim might produce.

Could 10 copies of Eliezer that only have 3 months apiece verify that an Eliezer that worked for 6-months really was as competent as a 6-month Eliezer would be? I bet they could.

But isn’t there a distribution shift from Eliezer sims performing 3-month tasks and Eliezer sims working together to perform a 6-month task?

Yes. But my claim is that the distribution shift is small, and that the smallness is important.

Would there not be “slippage” between each distribution shift? Surely the errors compound.

The errors certainly might compound if no one were looking out for that. But if the Eliezer sims are faithful sims, they should be looking out for that.

Consider the ball in bowl analogy below:

Suppose Eliezer sims started to degrade at each step of the bootstrapping process—perhaps their goals change a little bit or their reasoning starts to falter in some areas.

There are ways Eliezer sims could try to avoid this. For example, they could create a diverse population of Eliezer sims, which are likely to break in slightly different ways. Then these sims notice each other’s breakages (with both white-box methods like lie-detection or standard black-box methods), and fix them.

I don’t know if Eliezer sims will be able to do this, but it’s something they could try.

But won’t the Eliezer sims just be faking alignment anyway—or eventually start faking alignment, such that the scheme-y Eliezers eventually derail the process?

I don’t know, but I believe detecting and avoiding alignment faking is tractable—at least, it is more tractable than dealing with massive distribution shifts.

For instance, suppose an Eliezer sim thinks “I will perform these 6-month tasks safely, but then defect when I am given the responsibility of creating the next successor.” This thought might be detectable. Now the problem of scaling safety becomes a problem of detecting or avoiding this kind of conditional, deceptive reasoning.

I believe this is the main problem that humans need solve.
- habryka 20 Feb 2025 0:02 UTC
  LW: 35 AF: 18
  21
  AF Parent
  This thought might be detectable. Now the problem of scaling safety becomes a problem of detecting [...] this kind of conditional, deceptive reasoning.
  What do you do when you detect this reasoning? This feels like the part where all plans I ever encounter fail.
  Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that’s unlikely to get rid of it. I think at this point the natural answer is “yes, your systems are scheming against you, so you gotta stop, because when you train against it, you are probably primarily making it a better schemer”.
  I would be very surprised if you have a 3-month Eliezer that is not doing scheming the first time, and training your signals away is much easier than actually training away the scheming.
  What links here?
  - Validating against a misalignment detector is very different to training against one by mattmacdermott (4 Mar 2025 15:41 UTC; 29 points)
  - mattmacdermott 20 Feb 2025 20:11 UTC
    16 points
    7
    Parent
    
    Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that’s unlikely to get rid of it.
    
    I’m not necessarily asserting that this solves the problem, but seems important to note that the obviously-superior alternative to training against it is validating against it. i.e., when you observe scheming you train a new model, ideally with different techniques that you reckon have their own chance of working.
    
    However doomed you think training against the signal is, you should think validating against it is significantly less doomed, unless there’s some reason why well-established machine learning principles don’t apply here. Using something as a validation metric to iterate methods doesn’t cause overfitting at anything like the level of directly training on it.
    
    EDIT: later in the thread you say this this “is in some sense approximately the only and central core of the alignment problem”. I’m wondering whether thinking about this validation vs training point might cause you a nontrivial update then?
    - Archimedes 22 Feb 2025 5:04 UTC
      2 points
      0
      Parent
      
      Using something as a validation metric to iterate methods doesn’t cause overfitting at anything like the level of directly training on it.
      
      Validation is certainly less efficient at overfitting but it seems a bit like using an evolutionary algorithm rather than gradient descent. You aren’t directly optimizing according to the local gradient, but that doesn’t necessarily mean you’ll avoid Goodharting—just that you’re less likely to immediately fall into a bad local optimum.
      
      The likelihood of preventing Goodharting feels like it depends heavily on assumptions about the search space. The “validation” filters the search space to areas where scheming isn’t easily detectable, but what portion of this space is safe (and how can we tell)? We don’t actually have a true globally accurate validator oracle—just a weak approximation of one.
      - mattmacdermott 22 Feb 2025 8:56 UTC
        1 point
        2
        Parent
        I roughly agree, but it seems very robustly established in practice that the training-validation distinction is better than just having a training objective, even though your argument mostly applies just as well to the standard ML setup.
        
        You point out an important difference which is that our ‘validation metrics’ might be quite weak compared to most cases, but I still think it’s clearly much better to use some things for validation than training.
        
        Like, I think there are things that are easy to train away but hard/slow to validate away (just like when training an image classifier you could in principle memorise the validation set, but it would take a ridiculous amount of hyperparameter optimisation).
        
        One example might be if we have interp methods that measure correlates of scheming. Incredibly easy to train away, still possible to validate away but probably harder enough that ratio of non-schemers you get is higher than if trained against it, which wouldn’t affect the ratio at all.
        
        A separate argument is that I’m think if you just do random search over training ideas, rejecting if they don’t get a certain validation score, you actually don’t goodhart at all. Might put that argument in a top level post.
        Archimedes 22 Feb 2025 16:29 UTC
        1 point
        0
        Parent
        
        A separate argument is that I’m think if you just do random search over training ideas, rejecting if they don’t get a certain validation score, you actually don’t goodhart at all. Might put that argument in a top level post.
        
        I’d be interested in seeing this argument laid out.
        mattmacdermott 4 Mar 2025 15:48 UTC
        2 points
        0
        Parent
        I wrote it out as a post here.
  - joshc 20 Feb 2025 1:15 UTC
    LW: 8 AF: 5
    1
    AF Parent
    I think that if you agree “3-month Eliezer is scheming the first time” is the main problem, then that’s all I was trying to justify in the comment above.
    
    I don’t know how hard it is to train 3-month Eliezer not to scheme, but here is a general methodology one might purse to approach this problem.
    
    The methodology looks like “do science to figure out when alignment faking happens and does not happen, and work your ways toward training recipes that don’t produce alignment faking.”
    
    For example, you could use detection tool A to gain evidence about whether training against detection tool B actually removes alignment faking.
    
    To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don’t game them so hard).
    
    You might also find other ways to avoid alignment faking that don’t rely on training against detection tools. For example, perhaps if you train models to be highly corrigible and chill from the start, they tend not to fake alignment -- I don’t know.
    
    I’m uncertain about how likely alignment faking will be by default, or how hard it is to mitigate. I think these are messy empirical questions.
    
    However, I’m more confident that there are principled empirical methodologies we can apply to make progress on these questions. I don’t think my explanation here justifies this claim -- but I’m hoping to publish posts that go into more detail soon.
    - habryka 20 Feb 2025 1:21 UTC
      LW: 16 AF: 10
      11
      AF Parent
      To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don’t game them so hard).
      How do you iterate? You mostly won’t know whether you just trained away your signal, or actually made progress. The inability to iterate is kind of the whole central difficulty of this problem.
      (To be clear, I do think there are some methods of iteration, but it’s a very tricky kind of iteration where you need constant paranoia about whether you are fooling yourself, and that makes it very different from other kinds of scientific iteration)
      - joshc 20 Feb 2025 3:23 UTC
        LW: 5 AF: 1
        0
        AF Parent
        agree that it’s tricky iteration and requires careful thinking about what might be going on and paranoia.
        
        I’ll share you on the post I’m writing about this before I publish. I’d guess this discussion will be more productive when I’ve finished it (there are some questions I want to think through regarding this and my framings aren’t very crisp yet).
        habryka 20 Feb 2025 3:29 UTC
        LW: 13 AF: 6
        5
        AF Parent
        Seems good!
        FWIW, at least in my mind this is in some sense approximately the only and central core of the alignment problem, and so having it left unaddressed feels confusing. It feels a bit like making a post about how to make a nuclear reactor where you happen to not say anything about how to prevent the uranium from going critical, but you did spend a lot of words about the how to make the cooling towers and the color of the bikeshed next door and how to translate the hot steam into energy.
        Like, it’s fine, and I think it’s not crazy to think there are other hard parts, but it felt quite confusing to me.
        joshc 20 Feb 2025 3:42 UTC
        LW: 11 AF: 3
        1
        AF Parent
        I’m sympathetic to this reaction.
        
        I just don’t actually think many people agree that it’s the core of the problem, so I figured it was worth establishing this (and I think there are some other supplementary approaches like automated control and incentives that are worth throwing into the mix) before digging into the ‘how do we avoid alignment faking’ question
        Stephen McAleese 22 Feb 2025 11:58 UTC
        2 points
        0
        Parent
        I agree that this seems like a core alignment problem. The problem you are describing seems like a rephrasing of the ELK problem.
  - Stephen McAleese 22 Feb 2025 11:34 UTC
    3 points
    0
    Parent
    Here is a quote from the post describing what to do in that situation:
    These automatic control measures only raise the alarm. What happens after the alarm is raised? The developer might also implement autonomous incident response:
    Resampling: Force defective M_1 agents to take different actions.
    AI agent elimination: Delete suspicious agents in favor of other (hopefully decorrelated) unsuspicious ones.
    Fine-tuning: Train agents not to take suspicious actions again.
    I’m curious about how fine-tuning away scheming thoughts could still lead to scheming. Here are two possible outcomes:
    Fine-tuning actually robustly reduces scheming behavior (as intended) maybe because tokens and actions associated with scheming can be reliably detected and these tokens are required for scheming in the CoT.
    Fine-tuning reduces the appearance of scheming but doesn’t really reduce scheming: in this case, the fine-tuning reduces the probability of certain words or trajectories but the underlying ‘concept’ of scheming remains and the model merely rephrases its thinking in a way that is not detected as scheming and uses different tokens.
    For scenario 2, maybe you need something like activation engineering to change the underlying thought process rather than just token-level descriptions of it.
  - robertzk 21 Feb 2025 14:40 UTC
    LW: 1 AF: 1
    0
    AF Parent
    At this point I would direct the “deferred task” apparatus fully towards interventional interpretability. Put a moratorium on further gradient-based training, which is not well understood and can have many indirect effects unless you have some understanding of modularity and have applied stop gradients almost everywhere that is irrelevant to the generator of the conditional, deceptive reasoning behavior. Instead halt, melt and catch fire at that point.
    Halt further model deployments towards the original deferred task. Quarantine the model that first exhibited the reasoning and promote it to a model organism (never deployed further, only studied by other earlier models). Direct the apparatus (from earlier checkpoints before you reached this stage) towards surgically reducing it while preserving mechanisms relevant to the original deferred task.
    In the frame of enumerative safety, this would look something like identifying which circuits and features are minimal to the original deferred task (though this hinges upon task-minimal circuits being deception-free), evaluating the identified regions for mechanism preservation, localization and minimality and run-time ablating the rest.
    There is probably a much better frame, but this is a new start to the “induction base case” in a completely separate invocation of the original apparatus. There are other permutations of spinning up new deferred tasks, such as earlier models applying Microscope AI on the quarantined model organism to gain insight on how we got to this point, without any direct attempt to remediate the issue, which as you note is “unlikely to get rid of it” despite one naive attempt above.
    If your spinoff deferred tasks lead to the procurement of provable statements or heuristic arguments that the quarantined model cannot be minimized without “avoiding this kind of conditional, deceptive reasoning,” abandon the original deferred task and switch to a different plan. Otherwise, restart the apparatus towards the original deferred task when you have the proof artifacts.
    There are a lot of concerns you could raise with this additional structure but it seems like a distinct problem that requires a separate rebuttal rather than a hard stop fail? The obvious one is that these sorts of spinoff deferred tasks could be harder than the original task and consistently lead to the same failure mode, a la “exception thrown while handling previous exception.”
- Eliezer Yudkowsky 20 Feb 2025 0:48 UTC
  LW: 30 AF: 14
  15
  AF Parent
  I don’t think you can train an actress to simulate me, successfully, without her going dangerous. I think that’s over the threshold for where a mind starts reflecting on itself and pulling itself together.
  What links here?
  - Can we safely automate alignment research? by Joe Carlsmith (30 Apr 2025 17:37 UTC; 54 points)
  - Can we safely automate alignment research? by Joe_Carlsmith (EA Forum; 30 Apr 2025 17:37 UTC; 13 points)
  - joshc 20 Feb 2025 1:19 UTC
    LW: 10 AF: 2
    3
    AF Parent
    I would not be surprised if the Eliezer simulators do go dangerous by default as you say.
    
    But this is something we can study and work to avoid (which is what I view to be my main job)
    
    My point is just that preventing the early Eliezers from “going dangerous” (by which I mean from “faking alignment”) is the bulk of the problem humans need address (and insofar as we succeed, the hope is that future Eliezer sims will prevent their Eliezer successors from going dangerous too)
    
    I’ll discuss why I’m optimistic about the tractability of this problem in future posts.
    - Eliezer Yudkowsky 20 Feb 2025 3:03 UTC
      LW: 22 AF: 8
      4
      AF Parent
      So the “IQ 60 people controlling IQ 80 people controlling IQ 100 people controlling IQ 120 people controlling IQ 140 people until they’re genuinely in charge and genuinely getting honest reports and genuinely getting great results in their control of a government” theory of alignment?
      - joshc 20 Feb 2025 3:20 UTC
        LW: 13 AF: 4
        6
        AF Parent
        I’d replace “controlling” with “creating” but given this change, then yes, that’s what I’m proposing.
        Eliezer Yudkowsky 20 Feb 2025 7:24 UTC
        LW: 10 AF: 4
        −4
        AF Parent
        So if it’s difficult to get amazing trustworthy work out of a machine actress playing an Eliezer-level intelligence doing a thousand years worth of thinking, your proposal to have AIs do our AI alignment homework fails on the first step, it sounds like?
        joshc 20 Feb 2025 16:44 UTC
        LW: 16 AF: 5
        8
        AF Parent
        I do not think that the initial humans at the start of the chain can “control” the Eliezers doing thousands of years of work in this manner (if you use control to mean “restrict the options of an AI system in such a way that it is incapable of acting in an unsafe manner”)
        
        That’s because each step in the chain requires trust.
        
        For N-month Eliezer to scale to 4N-month Eliezer, it first controls 2N-month Eliezer while it does 2 month tasks, but it trusts 2-Month Eliezer to create a 4N-month Eliezer.
        
        So the control property is not maintained. But my argument is that the trust property is. The humans at the start can indeed trust the Eliezers at the end to do thousands of years of useful work—even though the Eliezers at the end are fully capable of doing something else instead.
- johnswentworth 22 Feb 2025 18:40 UTC
  LW: 23 AF: 12
  15
  AF Parent
  The 3 month Eliezer sim might spin up many copies of other 3 month Eliezer sims, which together produce outputs that a 6-month Eliezer sim might produce.
  This seems very blatantly not viable-in-general, in both theory and practice.
  On the theory side: there are plenty of computations which cannot be significantly accelerated via parallelism with less-than-exponential resources. (If we do have exponential resources, then all binary circuits can be reduced to depth 2, but in the real world we do not and will not have exponential resources.) Serial computation requirements are, in fact, a thing. So you can’t just have a bunch of Eliezers do in 3 months what a single 6 month Eliezer could do, in general.
  Even if you allow the 3-month Eliezer sims to act one-after-another, rather than trying to get everything done in 3 months simulated time via pure parallelism, there’s still a tight communication bottleneck between each Eliezer sim. There are presumably plenty of circuits which cannot be implemented with tight information bottlenecks every n serial steps.
  … of course you could circumvent all that theory by e.g. just having each Eliezer emulate a single logic gate, or some comparably trivial thing, but at that point you run afoul of the non-compositionality of safety properties: putting together a bunch of “safe” things (or “interpretable” things, or “aligned” things, or “corrigible” things, …) does not in-general produce a safe thing.
  So that’s the theory side. What about the practice side?
  Well, Ought did that roughly that experiment years ago and it did not work at all. And that should not be surprising—as the link argues, we have extremely ample evidence from day-to-day life that such things do not work in general.
  - joshc 23 Feb 2025 5:10 UTC
    LW: 5 AF: 4
    0
    AF Parent
    Probably the iterated amplification proposal I described is very suboptimal. My goal with it was to illustrate how safety could be preserved across multiple buck-passes if models are not egregiously misaligned.
    
    Like I said at the start of my comment: “I’ll describe [a proposal] in much more concreteness and specificity than I think is necessary because I suspect the concreteness is helpful for finding points of disagreement.”
    
    I don’t actually expect safety will scale efficiently via the iterated amplification approach I described. The iterated amplification approach is just relatively simple to talk about.
    
    What I actually expect to happen is something like:
    - Humans train AI agents that are smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end.
    - Early AI successors create smarter successors in basically the same way.
    - At some point, AI agents start finding much more efficient ways to safely scale capabilities. e.g. maybe initially, they do this with a bunch of weak-to-strong generalization research. And eventually they figure out how to do formally verified distillation.
    
    But at this point, humans will long be obsolete. The position I am defending in this post is that it’s not very important for us humans to think about these scalable approaches.
    - johnswentworth 23 Feb 2025 5:39 UTC
      LW: 15 AF: 8
      5
      AF Parent
      That’s a much more useful answer, actually. So let’s bring it back to Eliezer’s original question:
      Can you tl;dr how you go from “humans cannot tell which alignment arguments are good or bad” to “we justifiably trust the AI to report honest good alignment takes”? Like, not with a very large diagram full of complicated parts such that it’s hard to spot where you’ve messed up. Just whatever simple principle you think lets you bypass GIGO.
      [...]
      Broadly speaking, the standard ML paradigm lets you bootstrap somewhat from “I can verify whether this problem was solved” to “I can train a generator to solve this problem”.
      So to summarize your short, simple answer to Eliezer’s question: you want to “train AI agents that are [somewhat] smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end”. And then you hope/expect/(??have arguments or evidence??) that this allows us to (?justifiably?) trust the AI to report honest good alignment takes sufficient to put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome, despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.
      Or, to compact the summary even further: you want to train the somewhat-smarter-than-human AI on easily-verifiable synthetically-generated tasks, and then hope/expect that its good performance on those tasks generalizes to a problem which is not easily verifiable or synthetically generated, namely the problem of checking that a next generation of AI is in the basin of attraction of a good-to-humans outcome.
      (Note: I know you’ve avoided talking about the basin of attraction of a good-to-humans outcome, instead focused on just some short-term goal like e.g. not being killed by the very next generation of AI. Not focusing on the basin of attraction is a mistake, and we can go into why it’s a mistake if that turns out to be cruxy.)
      In Eliezer’s comment, he was imagining a training setup somewhat different from easily-verifiable synthetically-generated tasks:
      Assume that whenever OpenPhil tries to run an essay contest for saying what they’re getting wrong, their panel of judges ends up awarding the prize to somebody reassuringly saying that AI risk is an even smaller deal than OpenPhil thinks. How does OpenPhil bootstrap from that pattern of thumbs-up/thumbs-down to an AI that actually has better-than-OpenPhil alignment takes?
      … but the analogue of the problem Eliezer was highlighting, in the context of training on easily-verifiable synthetically-generated tasks, is the question: how and why would we justifiably trust that an AI trained on easily-verifiable synthetic tasks generalizes to not-easily-verifiable real-world tasks?
      - joshc 23 Feb 2025 6:21 UTC
        LW: 11 AF: 4
        2
        AF Parent
        > So to summarize your short, simple answer to Eliezer’s question: you want to “train AI agents that are [somewhat] smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end”. And then you hope/expect/(??have arguments or evidence??) that this allows us to (?justifiably?) trust the AI to report honest good alignment takes sufficient to put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome, despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.
        
        This one is roughly right. Here’s a clarified version:
        - If we train an AI that is (1) not faking alignment in an egregious way and (2) looks very competent and safe to us and (3) the AI seems likely to be able to maintain its alignment / safety properties as it would if humans were in the loop (see section 8), then I think we can trust this AI to “put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome” (at least as well as humans would have been able to do so if they were given a lot more time to attempt this task) “despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.”
        
        I think I might have created confusion by introducing the detail that AI agents will probably be trained on a lot of easy-to-verify tasks. I think this is where a lot of their capabilities will come from, but we’ll do a step at the end where we fine-tune AI agents for hard-to-verify tasks (similar to how we do RLHF after pre-training in the chatbot paradigm)
        
        And I think our disagreement mostly pertains to this final fine-tuning step on hard-to-verify tasks.
        
        Here’s where I think we agree. I think we both agree that if humans naively fine-tuned AI agents to regurgitate their current alignment takes, that would be way worse than if humans had way more time to think and do work on alignment.
        
        My claim is that we can train AI agents to imitate the process of how humans improve their takes over time, such that after the AI agents do work for a long time, they will produce similarly good outcomes as the outcomes where humans did work for a long time.
        
        Very concretely. I imagine that if I’m training an AI agent successor, a key thing I do is try to understand how it updates based on new evidence. Does it (appear) to update its views as reasonably as the most reasonable humans would?
        
        If so, and if it is not egregiously misaligned (or liable to become egregiously misaligned) then I basically expect that letting it go and do lots of reasoning is likely to produce outcomes that are approximately good as humans would produce.
        
        Are we close to a crux?
        
        Maybe a crux is training ~aligned AI agents to imitate the process by which humans improve their takes over time will lead to as good of outcomes as if we let humans do way more work.