abramdemski

Karma: 19,493

abramdemski 2 Jun 2025 19:00 UTC
LW: 2 AF: 2
0
AF
in reply to: sdeture’s comment on: Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)
This mostly made sense to me. I agree that it is a tricky question with a lot of moving pieces. In a typical RL setting, low entropy does imply low learning, as observed by Cui et al. One reason for this is because exploration is equated with randomness. RL typically works with point-estimates only, so the learning system does not track multiple hypotheses to test between. This prevents deterministic exploration strategies like VoI exploration, which explore based on the potential for gaining information, rather than just randomly.
My main point here is just to point out all these extra assumptions which are needed to make a strict connection between entropy and adaptability, making the observed empirical connection more empirical-only, IE not a connection which holds in all corner cases we can come up with.
However, I may be a bit more prone to think of humans as exploring intelligently than you are, IE, forming hypotheses and taking actions which test them, rather than just exploring by acting randomly.
I also don’t buy this part:
And the last piece, entropy being subjective, would be just the point of therapy and some of the interventions described in the other recent RLHF+ papers.
My concern isn’t that you’re anthropomorphizing the LLM, but rather, that you may be anthropomorphizing it incorrectly. The learned policy may have close to zero entropy, but that doesn’t mean that the LLM can predict its own actions perfectly ahead of time from its own subjective perspective. Meaning, my argument that adaptability and entropy are connected is a distinct phenomenon from the one noted by Cui, since the notions of entropy are different (mine being a subjective notion based on the perspective of the agent, and Cui’s being a somewhat more objective one based on the randomization used to sample behaviors from the LLM).
(Note: your link for the paper by Cui et al currently points back to this post, instead.)

abramdemski 1 Jun 2025 22:05 UTC
LW: 2 AF: 2
0
AF
in reply to: sdeture’s comment on: Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)
“A teacher who used the voice of authority exactly when appropriate, rather than inflexibly applying it in every case, could have zero entropy and still be very adaptive/flexible.” I’m not sure I would call this teacher adaptable. I might call them adapted in the sense that they’re functioning well in their current environment, but if the environment changed in some way (so that actions in the current state no longer led to the same range of consequences in later states), they would fail to adapt. (Horney would call this person neurotic but successful.)
So, in this scenario, what makes the connection between higher entropy and higher adaptability? Earlier, I mentioned that lower entropy could spoil exploration, which could harm one’s ability to learn. However, the optimal exploration policy (from a bayesian perspective) is actually zero entropy, because it maximizes value of information (whereas introducing randomness won’t do that, unless multiple actions happen to be tied for value of information).
The point being that if the environment changes, the teacher doesn’t strictly need to introduce entropy into their policy in order to adapt. That’s just a common and relatively successful method.
However, it bears mentioning that entropy is of course subjective; we might need to ask from whose perspective we are measuring the entropy. Dice have low entropy from the perspective of a physics computation which can predict how they will land, but high entropy from the perspective of a typical human who cannot. An agent facing a situation they don’t know how to think about yet necessarily has high entropy from their own perspective, since they haven’t yet figured out what they will do. In this sense, at least, there is a strict connection between adaptability and entropy.

abramdemski 1 Jun 2025 17:07 UTC
LW: 3 AF: 3
0
AF
on: Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)
Entropy isn’t going to be a perfect measure of adaptability, eg,
Or think back to our teacher who uses a voice of authority, even if it’s false, to answer every question: this may have been a good strategy when they were a student teacher. Just like the reasoning LLM in its “early training stage,” the probability of choosing a new response (choosing not to people please; choosing to admit uncertainty) drops, i.e., policy entropy collapses, and the LLM or person has developed a rigid habitual behavior.
A teacher who used the voice of authority exactly when appropriate, rather than inflexibly applying it in every case, could have zero entropy and still be very adaptive/flexible.
However, the connection you’re making does generally make sense. EG, a model which has already eliminated a lot of potential behaviors from its probability distribution isn’t going to explore very well for RL training. I also intuitively agree that this is related to alignment problems, although to be clear I doubt that solving this problem alone would avert serious AI risks.
Yoshua Bengio claims to have the solution to this problem: Generative Flow Networks (which also fit into his larger program to avert AI risks). However, I haven’t evaluated this in detail. The main claim as I understand it is about training to produce solutions proportional to their reward, instead of training to produce only high-reward outputs.
It feels like a lot of what you’re bringing up is tied to a sort of “shallowness” or “short-sightedness” more closely than entropy. EG, always saying yes to go to the bar is low-entropy, but not necessarily lower entropy than the correct strategy (EG, always saying yes unless asked by someone you know to have an alcohol problem, in which case always no). What distinguishes it is short-sightedness (you mention short-term incentives like how the friend reacts immediately in conversation) and a kind of simplicity (always saying yes is a very context-free pattern, easy to learn).
I’m also reminded of Richard Watson’s talk Songs of Life and Mind, where he likens adaptive systems to mechanical deformations which are able to snap back when the pressure is released, vs modern neural networks which he likens to crumpled paper or a fallen tower. (See about 28 minutes into the video, although it might not make a lot of sense without more context.)

Alignment Proposal: Adversarially Robust Augmentation and Distillation

Cole Wyeth and abramdemski

25 May 2025 12:58 UTC

54 points

45 comments13 min readLW link

abramdemski 22 May 2025 15:55 UTC
2 points
0
on: Alignment Proposal: Adversarially Robust Augmentation and Distillation
two types of problem
I don’t think it is sufficiently clear what the “two types of problem” are on first reading; I take it that the “two” refers to the two paragraphs above, but I think it would be helpful to insert a brief parenthetical, eg
I argue that overcoming these two types of problem (increasing intelligence without changing values, and digitization of human intelligence) is also sufficient [...]

Events: Debate & Fiction Project

abramdemski16 May 2025 21:51 UTC

39 points

1 comment1 min readLW link

Understanding Trust: Overview Presentations

abramdemski16 Apr 2025 18:08 UTC

22 points

0 comments1 min readLW link

Understanding Trust—Overview Presentations

abramdemski16 Apr 2025 18:05 UTC

13 points

0 comments1 min readLW link

abramdemski 14 Apr 2025 19:33 UTC
2 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: abramdemski’s Shortform
Moreover, the real question is to what degree the development of competitive solar energy was the result of a purposeful policy. People like to believe that tech development subsidies have a large counterfactual but imho this needs to be explicitly proved and my prior is that the effect is probably small compared to overall general development of technology & economic incentives that are not downstream of subsidies / government policy.
But we don’t need to speculate about that in the case of AI! We know roughly how much money we’ll need for a given size of AI experiment (eg, a training run). The question is one of raising the money to do it. With a strong enough safety case vs the competition, it might be possible.
I’m curious if you think there are any better routs; IE, setting aside the possibility of researching safer AI technology & working towards its adoption, what overall strategy would you suggest for AI safety?

abramdemski 9 Apr 2025 16:15 UTC
LW: 2 AF: 2
0
AF
in reply to: Vladimir_Nesov’s comment on: abramdemski’s Shortform
This sort of approach doesn’t make so much sense for research explicitly aiming at changing the dynamics in this critical period. Having an alternative, safer idea almost ready-to-go (with some explicit support from some fraction of the AI safety community) is a lot different from having some ideas which the AI could elaborate.

abramdemski 8 Apr 2025 19:13 UTC
LW: 5 AF: 2
0
AF
in reply to: Cole Wyeth’s comment on: abramdemski’s Shortform
The pre-training phase is already finding a mesa-optimizer that does induction in context. I usually think of this as something like Solomonoff induction with a good inductive bias, but probably you would expect something more like logical induction. I expect the answer to be somewhere in between.
I don’t personally imagine current LLMs are doing approximate logical induction (or approximate solomonoff) internally. I think of the base model as resembling a circuit prior updated on the data. The circuits that come out on top after the update also do some induction of their own internally, but it is harder to think about what form of inductive bias they have exactly (it would seem like a coincidence if it also happened to be well-modeled as a circuit prior, but, it must be something highly computationally limited like that, as opposed to Solomonoff-like).
I hesitate to call this a mesa-optimizer. Although good epistemics involves agency in principle (especially time-bounded epistemics), I think we can sensibly differentiate between mesa-optimizers and mere mesa-induction. But perhaps you intended this stronger reading, in support of your argument. If so, I’m not sure why you believe this. (No, I don’t find “planning ahead” results to be convincing—I feel this can still be purely epistemic in a relevant sense.)
Perhaps it suffices for your purposes to observe that good epistemics involves agency in principle?
Anyway, cutting more directly to the point:
I think you lack imagination when you say
[...] which can realistically compete with modern LLMs would ultimately look a lot like a semi-theoretically-justified modification to the loss function or optimizer of agentic fine-tuning / RL or possibly its scaffolding [...]
I think there are neural architectures close to the current paradigm which don’t directly train whole chains-of-thought on a reinforcement signal to achieve agenticness. This paradigm is analogous to model-free reinforcement learning. What I would suggest is more analogous to model-based reinforcement learning, with corresponding benefits to transparency. (Super speculative, of course.)

abramdemski 8 Apr 2025 18:57 UTC
4 points
2
in reply to: Ivan Vendrov’s comment on: Ivan Vendrov’s Shortform
Given infinite compute, Bayesian optimization like this doesn’t make sense (at least for well-defined objective functions), because you can just select the single best point in the search space.
what makes you confident that evolutionary search under computational resource scarcity selects for anything like an explicit Bayesian optimizer or long term planner? (I say “explicit” because the Bayesian formalism has enough free parameters that you can post-hoc recast ~any successful algorithm as an approximation to a Bayesian ideal)
1. I would not argue for “explicit”. If I had to argue for “explicit” I would say: because biological organisms do in fact have differentiated organs which serve somewhat comprehensible purposes, and even the brain has somewhat distinct regions serving specific purposes. However, I think the argument for explicit-or-implicit is much stronger.
2. Even so, I would not argue that evolutionary search under computational resource scarcity selects for a long-term planner, be it explicit or implicit. This would seem to depend on the objective function used. For example, I would not expect something trained on an image-recognition objective to exhibit long-term planning.
3. I’m curious why you specify evolutionary search rather than some more general category that includes gradient descent and other common techniques which are not Bayesian optimization. Do you expect it to be different in this regard?
I’m not sure why you asked the question, but it seems probably that you thought a “confident belief that [...]” followed from my view expressed in the previous comment? I’m curious about your reasoning there. To me, it seems unrelated.
These issues are tricky to discuss, in part because the term “optimization” is used in several different ways, which have rich interrelationships. I conceptually make a firm distinction between search-style optimization (gradient descent, genetic algorithms, natural selection, etc) vs agent-style optimization (control theory, reinforcement learning, brains, etc). I say more about that here.
The proposal of Bayesian Optimization, as I understand it, is to use the second (agentic optimization) in the inner loop of the first (search). This seems like a sane approach in principle, but of course it is handicapped by the fact that Bayesian ideas don’t represent the resource-boundedness of intelligence particularly well, which is extremely critical for this specific application (you want your inner loop to be fast). I suspect this is the problem you’re trying to comment on?
I think the right way to handle that in principle is to keep the Bayesian ideal as the objective function (in a search sense, not an agency sense) and search for a good search policy (accounting for speed as well as quality of decision-making), which you then use for many specific searches going forward.

abramdemski 8 Apr 2025 15:14 UTC
10 points
2
in reply to: Ivan Vendrov’s comment on: Ivan Vendrov’s Shortform
Pursuit of novelty is not vnm-incoherent. Furthermore, it is an instrumentally convergent drive; power-seeking agents will seek novelty as well, because learning increases power in expectation (see: value of information).
The argument made in Novelty Search and the Problem with Objectives is based on search processes which inherently cannot do long-term planning (they are myopically trying to increase their score on the objective). These search processes don’t do as well as explicit pursuit of novelty because they aren’t planning to search effectively, so there’s no room in their cognitive architecture for the instrumental convergence towards novelty-seeking to take place. (I’m basing this conclusion on the abstract.) This architectural limitation of most AI optimization methods is mitigated by Bayesian optimization methods (which explicitly combine information-seeking with the normal loss-avoidance).

abramdemski 8 Apr 2025 15:00 UTC
LW: 60 AF: 29
−1
AF
on: abramdemski’s Shortform
Here’s what seem like priorities to me after listening to the recent Dwarkesh podcast featuring Daniel Kokotajlo:

1. Developing the safer AI tech (in contrast to modern generative AI) so that frontier labs have an alternative technology to switch to, so that it is lower cost for them to start taking warning signs of misalignment of their current tech tree seriously. There are several possible routes here, ranging from small tweaks to modern generative AI, to scaling up infrabayesianism (existing theory, totally groundbreaking implementation) to starting totally from scratch (inventing a new theory). Of course we should be working on all routes, but prioritization depends in part on timelines.
- I see the game here as basically: look at the various existing demos of unsafety and make a counter-demo which is safer on multiple of these metrics without having gamed the metrics.
2. De-agentify the current paradigm or the new paradigm:
- Don’t directly train on reinforcement across long chains of activity. Find other ways to get similar benefits.
- Move away from a model where the AI is personified as a distinct entity (eg, chatbot model). It’s like the old story about building robot arms to help feed disabled people—if you mount the arm across the table, spoonfeeding the person, it’s dehumanizing; if you make it a prosthetic, it’s humanizing.
  - I don’t want AI to write my essays for me. I want AI to help me get my thoughts out of my head. I want super-autocomplete. I think far faster than I can write or type or speak. I want AI to read my thoughts & put them on the screen.
    There are many subtle user interface design questions associated with this, some of which are also safety issues, eg, exactly what objective do you train on?
  - Similarly with image generation, etc.
  - I don’t necessarily mean brain-scanning tech here, but of course that would be the best way to achieve it.
  - Basically, use AI to overcome human information-processing bottlenecks instead of just trying to replace humans. Putting humans “in the loop” more and more deeply instead of accepting/assuming that humans will iteratively get sidelined.

abramdemski 28 Mar 2025 12:50 UTC
6 points
2
in reply to: tobytrem’s comment on: tobytrem’s Shortform
My objection is that Smoking Lesion is a decision problem which we can’t drop arbitrary decision procedures into to see how they do; its necessary that the decision procedure might have a lesion influencing it. If you drop an CDT decision procedure into the problem, then the claimed population statistics can’t apply to you, since CDT always smokes in this problem—either you’re a CDT mutant and would be mistaken to apply the population statistics to yourself, or everyone is CDT and the population statistics can’t be as claimed. Similarly with EDT. Therefore, to me, this decision problem isn’t a legitimate test of a decision procedure: you can only test the decision procedure by lying to it about the problem (making it believe the problem-statement statistics are representative of it), or by mangling the decision procedure (adding a lesion into it somehow).

abramdemski 25 Mar 2025 20:42 UTC
15 points
4
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
There are a lot of replies here, so I’m not sure whether someone already mentioned this, but: I have heard anecdotally that homosexual men often have relationships which maintain the level of sex over the long term, while homosexual women often have long-term relationships which very gradually decline in frequency of sex, with barely any sex after many decades have passed (but still happily in a relationship).
This mainly argues against your model here:
This also fits with my general models of mating markets: women usually find the large majority of men sexually unattractive, most women eventually settle on a guy they don’t find all that sexually attractive, so it should not be surprising if that relationship ends up with very little sex after a few years.
It suggests instead that female sex drive naturally falls off in long-term relationships in a way that male sex drive doesn’t, with sexual attraction to a partner being a smaller factor.

abramdemski 25 Mar 2025 0:59 UTC
LW: 2 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: Notes on countermeasures for exploration hacking (aka sandbagging)
ou need to ensure substantial probability on exploring good strategies which is a much stronger property than just avoiding mode collapse. (Literally avoiding mode collapse and assigning some probability to all good actions is easy—just add a small weight on a uniform prior over tokens like they did in old school atari RL.)
Yeah, what I really had in mind with “avoiding mode collapse” was something more complex, but it seems tricky to spell out precisely.
Even if the model is superhuman, if we do a good job with exploration hacking countermeasures, then the model might need to be extremely confident humans wouldn’t be able to do something to avoid exploring it.
It’s an interesting point, but where does the “extremely” come from? Seems like if it thinks there’s a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values, it could be a very worthwhile gamble. Maybe I’m unclear on the rules of the game as you’re imagining them.

abramdemski 24 Mar 2025 20:11 UTC
LW: 6 AF: 6
2
AF
on: Notes on countermeasures for exploration hacking (aka sandbagging)
- Training on high quality human examples: One basic starting point is to initialize RL by training the AI to imitate on-distribution high quality human examples such that it has to explore into strategies which are at least as good as these human trajectories. We could also try to construct such examples adversarially (e.g., train AIs to point out errors we adversarially insert).
- More granular feedback and trajectory edits: In some cases, constructing good enough imitation trajectories may be challenging, for instance, it seems very hard for humans to construct a long reasoning trace for a reasoning model (though it is unclear how good the human examples would need to be to force exploration).
Overall, the quality of these proposals basically depends on their ability to explore well. Exploring well obviously doesn’t just mean injecting more randomness, since a high-entropy approach will take a very very long time to find useful strategies in complex domains. Training on high-quality human examples avoids this problem by ensuring that we at least explore strategies often employed by human experts, which are going to be some of the better strategies.
However, if the AI is being trained to (even mildly) superhuman levels of capability (on a given subject matter), then clearly human imitation does not do much good anymore; the AI is free to sandbag at mildly superhuman levels of capability (within the given subject matter). Training on high-quality human examples might even cause or exacerbate this (basically, giving the AI the idea that it could imitate humans, which can then be a good sandbagging strategy).
So, basically, training to a given capability level (while robustly avoiding sandbagging) seems to require exploration ability near that specific capability level.
The only way to get this that I can think of is to use an RL method which guarantees against mode-collapse, IE, a technique which keeps a broad distribution over the strategies which achieve good reward, rather than collapsing down to one particular way of getting good reward as training proceeds. This guarantees that the only reason why a specific strategy is not explored is because its expected value is low (according to all plausible hypotheses).

abramdemski 24 Mar 2025 18:36 UTC
8 points
1
in reply to: Richard_Kennaway’s comment on: Lorxus’s Shortform
I really like the ‘trying not to know’ one, because there are lots of things I’m trying not to know all the time (for attention-conservation reasons), but I don’t think I have very good strategies for auditing the list.

abramdemski 22 Mar 2025 3:19 UTC
4 points
0
in reply to: abramdemski’s comment on: abramdemski’s Shortform
I’m thinking about AI emotions. The thing about human emotions and expressions is that they’re more-or-less involuntary. Facial expressions, tone of voice, laughter, body language, etc reveal a whole lot about human inner state. We don’ know if we can trust AI emotional expressions in the same way; the AIs can easily fake it, because they don’t have the same intrinsic connection between their cognitive machinery and these … expressions.
A service called Face provides emotional expressions for AI. It analyzes AI-generated outputs and makes inferences about the internal state of the AI who wrote the text. This is possible due to Face’s interpretability tools, which have interpreted lots of modern LLMs to generate labels on their output data explaining their internal motivations for the writing. Although Face doesn’t have access to the internal weights for an arbitrary piece of text you hand it, its guesses are pretty good. It will also tell you which portions were probably AI-generated. It can even guess multi-step writing processes involving both AI and human writing.
Face also offers their own AI models, of course, to which they hook the interpretability tools to directly, so that you’ll get more accurate results.
It turns out Face can also detect motivations of humans with some degree of accuracy. Face is used extensively inside the Face company, which is a nonprofit entity which develops the open-source software. Face is trained on outcomes of hiring decisions so as to better judge potential employees. This training is very detailed, not just a simple good/bad signal.
Face is the AI equivalent of antivirus software; your automated AI cloud services will use it to check their inputs for spam and prompt injection attacks.
Face company culture is all about being genuine. They basically have a lie detector on all the time, so liars are either very very good or weeded out. This includes any kind of less-than-genuine behavior. They take the accuracy of Face very seriously, so they label inaccuracies which they observe, and try to explain themselves to Face. Face is hard to fool, though; the training aggregates over a lot of examples, so an employee can’t just force Face to label them as honest by repeatedly correcting its claims to the contrary. That sort of behavior gets flagged for review even if you’re the CEO. (If you’re the CEO, you might be able to talk everyone into your version of things, however, especially if you secretly use Art to help you and that’s what keeps getting flagged.)

abramdemski

Align­ment Pro­posal: Ad­ver­sar­i­ally Ro­bust Aug­men­ta­tion and Distillation

Events: De­bate & Fic­tion Project

Un­der­stand­ing Trust: Overview Presentations

Un­der­stand­ing Trust—Overview Presentations

Alignment Proposal: Adversarially Robust Augmentation and Distillation

Events: Debate & Fiction Project

Understanding Trust: Overview Presentations

Understanding Trust—Overview Presentations