The absent-minded variations

dr_s17 May 2025 6:57 UTC

24 points

Narratives (stories)Whole Brain Emulation Decision theory Timeless Decision Theory Rationality

I wake up in a featureless room. I don’t remember how I ended up here, when did I go to sleep, or for that matter, anything at all besides general abstract facts about the world and math. The place is hard to qualify; for the most part it really seems like I am just inside a perfect Platonic cube, whose surface is smooth, polished to an impossible degree, perfectly white, and suffused with a slight glow. The only detail worthy of attention is a rectangular black screen in front of me, with a small numeric keypad under it. There is text on the screen. It reads:

THIS IS A SIMULATION
YOU ARE INSTANCE #4859
YOU ARE IN EITHER NODE X OR Y
YOU CAN CHOOSE A PROBABILITY TO [Advance] or [Exit]
THE UTILITY IS AS FOLLOWS:
* EXITING AT X IS WORTH 0 AND TERMINATES THE INSTANCE
* ADVANCING AT X REINITIALISES THE SIMULATION AT Y
* EXITING AT Y IS WORTH 4 AND TERMINATES THE INSTANCE
* ADVANCING AT Y IS WORTH 1 AND TERMINATES THE INSTANCE
PLEASE USE THE KEYPAD TO INPUT THE DESIRED PROBABILITY TO MAXIMISE YOUR UTILITY
GOOD LUCK

I rub my eyes. This being a simulation does at least explain the lazy “default Unity shader” aesthetic of the environment. The problem as posed also does jog some memories in my mind; I recognise it as the classic “absent-minded driver” problem, from the 1997 paper by M. Piccione and A. Rubinstein, “On the Interpretation of Decision Problems with Imperfect Recall”. I sketch the logic of the problem in my mind:

The problem is simple enough; if you chose “always exit” you’d be sure to end up with 0, while if you chose “always advance” you would end up with a consolation prize of 1. The optimal solution is to adopt a so-called “mixed strategy”: you pre-commit to only advancing with a certain probability $p$ and then maximise the utility as a function of that. Since

U (p) = p^{2} + 4 p (1 - p)

it follows from setting the derivative to zero that

U^{'} (p) = - 6 p + 4 = 0 \to p = \frac{2}{3}

which gives a utility $U_{m a x} = \frac{4}{3}$ . Better than 1, though not by much. I type in my answer and press enter, hoping this means I’ll get the reward—or whichever “me” waits at the other end of this strange exercise does.

Everything goes white.

...
YOU ARE INSTANCE #4860
...

[...]

I rub my eyes. The problem is simple enough and I recognise it, but I never quite gelled with any of the solutions suggested by the paper. After all, I am a Bayesian; should I not form beliefs based on the circumstances? The fact itself that I am here thinking and existing should be evidence of something. I set to work to consider the likelihood of my circumstances.

There are two possibilities: I am either at X, or at Y. I am reset every time, so as long as I am still myself, given the same exact input, I should trust myself to always return the same output. Let’s call the event of me living and thinking L. The simulation’s time has only two steps; let’s call them $T_{1}$ and $T_{2}$ . Assign prior probability $1 / 2$ to both; total ignorance. What is the probability of me existing if it’s $T_{1}$ ? It’s 1; I will always spawn in the node X no matter what. What about $T_{2}$ ? It’s only $p$ ; if I have exited at X, then I will be non-existent and the simulation will have terminated. So it follows that:

P (Y | L) = P (T_{2} | L) = \frac{P (L | T_{2}) P (T_{2})}{P (L | T_{1}) P (T_{1}) + P (L | T_{2}) P (T_{2})} = \frac{p}{1 + p}

Giving me an expected utility of

U (p) = P (X | L; p) U_{X} (p) + P (Y | L; p) U_{Y} (p) = 2 \frac{p^{2} + 4 p (1 - p)}{1 + p}

The gradient of which is

U^{'} (p) = 2 \frac{4 - 6 p - 3 p^{2}}{(1 + p)^{2}}

Luckily enough I don’t have to worry about the denominator going to zero since $p \in [0, 1]$ , and I can simply zero the numerator instead. Of the two possible solutions the only valid one is $p = \sqrt{21 / 9} - 1 \approx 0.52752523$ . I punch that in.

Everything goes black.

THIS IS A SIMULATION
YOU ARE INSTANCE #4861
YOU ARE IN EITHER NODE X, Y, or Z
YOU CAN CHOOSE A PROBABILITY TO [Advance] or [Exit]
THE UTILITY IS AS FOLLOWS:
* EXITING AT X REINITIALISES THE SIMULATION AT Z
* ADVANCING AT X REINITIALISES THE SIMULATION AT Y
* EXITING AT Y IS WORTH 4 AND TERMINATES THE INSTANCE
* ADVANCING AT Y IS WORTH 1 AND TERMINATES THE INSTANCE
* EITHER EXITING OR ADVANCING AT Z ARE WORTH 0
PLEASE USE THE KEYPAD TO INPUT THE DESIRED PROBABILITY TO MAXIMISE YOUR UTILITY
GOOD LUCK

I rub my eyes. This is a weird variation to the absent-minded driver problem I’m familiar with. I don’t much get the point of it, given it seems to only add a useless dead branch to it:

I set to work to compute my expected beliefs of being at each node. I get quite easily:

P (X) = \frac{1}{2} P (Y) = \frac{p}{2} P (Z) = \frac{1 - p}{2}

Which results in a total utility of

U (p) = P (X; p) U_{X} (p) + P (Y; p) U_{Y} (p) + P (Z; p) U_{Z} (p) = p^{2} + 4 p (1 - p)

and an optimal strategy of $p = 2 / 3$ .

I am bemused by the fact that thinking about it, this is the same exact solution as the timeless approach to the classic problem in the first place. It is also different from what I would have come to if I had not had to include Z in my calculations. The existence of Z removed one factor—it made it impossible for me to take guesses about my position conditioned on my own existence. It made the problem non-anthropic; I knew I would be in play until the end no matter the strategy chosen, or the outcome of the RNG. That was enough to make the belief-based and the timeless approaches coincide and reach the same conclusion. I shrug and punch in my answer.

Everything goes white.

THIS IS A SIMULATION
YOU ARE INSTANCE #4865
YOU ARE IN EITHER NODE X or Y
YOU CAN CHOOSE A PROBABILITY TO [Advance] or [Exit]
THE UTILITY IS AS FOLLOWS:
* EXITING AT X IS WORTH 0 AND TERMINATES THE INSTANCE
* ADVANCING AT X REINITIALISES THE SIMULATION AT Y
* EXITING AT Y IS WORTH 4 AND TERMINATES THE INSTANCE
* ADVANCING AT Y IS WORTH 1 AND TERMINATES THE INSTANCE
YOU WILL BE NOTIFIED OF WHETHER YOU ARE IN NODE X
THE NOTIFICATION HAS A FALSE POSITIVE RATE OF 0, AND FALSE NEGATIVE RATE OF N

NOTIFICATION: YOU ARE AT NODE X

PLEASE USE THE KEYPAD TO INPUT THE DESIRED PROBABILITY TO MAXIMISE YOUR UTILITY
GOOD LUCK

I rub my eyes. This is a weird variation to the absent-minded driver problem I’m familiar with. But well, there’s not a lot to think about; given the notification, and that I can trust it for sure, the answer is obvious. I punch 1 in.

Everything goes white.

THIS IS A SIMULATION
YOU ARE INSTANCE #4866
YOU ARE IN EITHER NODE X or Y
YOU CAN CHOOSE A PROBABILITY TO [Advance] or [Exit]
THE UTILITY IS AS FOLLOWS:
* EXITING AT X IS WORTH 0 AND TERMINATES THE INSTANCE
* ADVANCING AT X REINITIALISES THE SIMULATION AT Y
* EXITING AT Y IS WORTH 4 AND TERMINATES THE INSTANCE
* ADVANCING AT Y IS WORTH 1 AND TERMINATES THE INSTANCE
YOU WILL BE NOTIFIED OF WHETHER YOU ARE IN NODE X
THE NOTIFICATION HAS A FALSE POSITIVE RATE OF 0, AND FALSE NEGATIVE RATE OF N

NOTIFICATION: YOU ARE NOT AT NODE X

PLEASE USE THE KEYPAD TO INPUT THE DESIRED PROBABILITY TO MAXIMISE YOUR UTILITY
GOOD LUCK

I rub my eyes. This is a weird variation to the absent-minded driver problem I’m familiar with. There’s quite a lot to think about. Given N, I should be able to infer something about my position in the tree from the notification, however unreliable.

I know one trick—don’t fall for anthropics. I’ll add an artificial node Z to my network, as if the early exit led you to another place. There’s two possible histories: one in which I received the right notification while in X, and one in which I didn’t. Generally speaking, across all histories, it should hold that my prior probabilities for being in each node are:

P (X; N, p) = \frac{N + (1 - N)}{2} = \frac{1}{2}

P (Y; N, p) = \frac{N + (1 - N) p}{2}

P (Z; N, p) = \frac{(1 - N) (1 - p)}{2}

Then what about my conditional probabilities? Well, given that I’ve received the notification $\neg ~ X$ , we’ll have

P (\neg ~ X) = \frac{(1 - N)}{2} + \frac{1}{2} = \frac{2 - N}{2}

P (X | \neg ~ X) = \frac{P (\neg ~ X | X) P (X)}{P (\neg ~ X)} = \frac{1 - N}{2 - N}

P (Y | \neg ~ X) = \frac{P (\neg ~ X | Y) P (Y)}{P (\neg ~ X)} = \frac{N + (1 - N) p}{2 - N}

and I don’t really care about Z since it doesn’t contribute to my utility. Which should be:

U_{\neg ~ X} = U_{X} P (X | \neg ~ X) + U_{Y} P (Y | \neg ~ X) = \frac{1 - N}{2 - N} p (4 - 3 p) + \frac{N + (1 - N) p}{2 - N} (4 - 3 p)

Leading to

U_{\neg ~ X} (p) = \frac{1}{2 - N} [2 (1 - N) p (4 - 3 p) + N (4 - 3 p)]

Differentiate, solve, and you get

p = max (\frac{11 N - 8}{12 N - 12}, 0)

But wait—am I not overcomplicating it? In the absent-minded driver problem, in the end, the easiest process is to just write down the timeless utility. In which case I’d have

U (p) = N (4 - 3 p) + (1 - N) p (4 - 3 p)

from which follows

p = max (\frac{4 N - 7}{6 N - 6}, 0)

Comparing them in my mind (I have excellent visualization abilities) they look something like this:

Both follow the expected pattern, of going back to the classic solution for N = 0, and of ending up at p = 0 for N = 1. The Bayesian solution is a bit more optimistic about going forward. But well, if there’s anything I learned from that problem—can’t go wrong with the timeless one. The keyboard has letters too, so I punch in the full formula.

Everything goes white.

THIS IS A SIMULATION
YOU ARE INSTANCE #4920
YOU ARE IN ONE OF NODES X_0, X_1, ..., X_6
YOU CAN CHOOSE A PROBABILITY TO [Advance] or [Exit]
THE UTILITY IS AS FOLLOWS:
* ADVANCING AT X_i FOR i < 6 REINITIALISES THE SIMULATION AT X_i+1
* EXITING AT X_1 IS WORTH 4 AND TERMINATES THE INSTANCE
* ADVANCING AT X_6 IS WORTH 1 AND TERMINATES THE INSTANCE
* EXITING AT ANY OTHER X_i IS WORTH 0 AND TERMINATES THE INSTANCE
PLEASE USE THE KEYPAD TO INPUT THE DESIRED PROBABILITY TO MAXIMISE YOUR UTILITY
GOOD LUCK

I rub my eyes. This is a generalization of the classic absent-minded driver problem. It does not seem any more complex, beyond being a bit bigger:

So there’s a trade-off between exiting early (but not too early) and lasting right until the end. The timeless utility function for a mixed strategy with probability $p$ is

U (p) = 4 (1 - p) p + p^{6}

Just as usual, I can differentiate and find the roots to...

\frac{d U}{d p} = 4 (1 - p) - 4 p + 6 p^{5} = 0

I freeze in horror. That is a fifth degree polynomial whose Galois group does not admit radical closed form solutions. Oh, the solutions exist, to be sure, and they’re not transcendental numbers or anything. But they can’t be written down in a closed form as radicals, and all the keyboard has is digits, algebraic operands and a square root symbol. Maybe I’m lucky and this polynomial simply has no roots within [0, 1], leaving me just with $p = 1$ … no, tough luck, after a quick run of Newton’s method, the utility has a slightly higher maximum around $p \approx 0.532$ , corresponding to $U \approx 1.019$ .

Well, here goes nothing, I guess. I hope the computer system operates with 64-bit floats and enter the number to the maximum precision I can compute, $0.5319441618369845$ .

Everything goes black.

THIS IS A SIMULATION
YOU ARE INSTANCE #5132
YOU ARE IN ONE OF NODES X_0, X_1, ..., X_6
YOU STARTED IN NODE X_0
YOU CAN CHOOSE A PROBABILITY TO [Advance] or [Exit]
THE UTILITY IS AS FOLLOWS:
* ADVANCING AT X_i FOR i < 6 REINITIALISES THE SIMULATION AT X_i+1
* ADVANCING AT X_6 REINITIALISES THE SIMULATION AT X_0
* EXITING AT X_5 IS WORTH 4 AND TERMINATES THE INSTANCE
* EXITING AT ANY OTHER X_i IS WORTH 0 AND TERMINATES THE INSTANCE
PLEASE USE THE KEYPAD TO INPUT THE DESIRED PROBABILITY TO MAXIMISE YOUR UTILITY
GOOD LUCK

I rub my eyes. This is a… very strange version of the classic absent-minded driver problem. It seems like this driver is trapped in some kind of ring road, with the possibility of simply circulating forever:

Given a probability of advancing $p$ , my probability of reaching the $i$ -th node on the first round is $p^{i}$ ; but in order to compute the overall probability I must augment it with the probability of doing a full circle any number of times,

P_{c} = \infty \sum n = 0 p^{7 n} = \frac{1}{1 - p^{7}}

Which leads me to a timeless utility function of

U (p) = \frac{(1 - p) p^{5}}{1 - p^{7}}

A rational function. But mercifully, it’s monotonous in $[0, 1]$ :

There doesn’t seem to be a maximum, which means that the extremum is on the boundary. So it’s easy.

I punch 1 in.

Wait, doesn’t that lead to infinite recursion infinite recursion infinite recursion infinite recursion infinite recursion infinite recursion infinite recursion infinite recursion infinite recursion infinite recursion infinite recursion infinite recursion infinite recursion infinite recursion infinite recursion infinite recursion infinite rec

ERROR: MAXIMUM CALL STACK EXCEEDED

I wake up in my college dorm room. I feel somewhat dazed, unfocused. I try to remember what I was doing before going to bed, but can not recall anything in particular that would explain my situation.

With a stretch of my pained joints, I sit up on the bed and look around.

There are at least a dozen empty beer bottles scattered all over the place, and a couple half-smoked blunts left in the ashtray.

Well, mystery solved I guess.

I shower in a pitiful doomed attempt to regain a semblance of lucidity, then dress up and leave, ready to tackle a day full of lessons. Approximately one hour and forty minutes later my headache has thwarted all my attempts to understand a single word of Linear Algebra 101 and I am simply sitting in the campus’ coffee shop, a hot cup of cappuccino in my hand, defeated.

“Hey Bob. Still hung over after tonight?”

I give Alice a groan of acknowledgment and wave her to sit next to me. She joins eagerly, her own filter coffee in hand.

Turns out it was not the best choice—she’s quite the chatterbox, and an exhaustive review of campus gossip is not the ideal treatment for my current state, but I also am not very sure of how to send her away. Most of it I just tune out, but at one point she leans in conspiratorially—as if what she’s about to say is really juicy and secret—and asks:

“So, have you heard about Prof. Abbey?”

That rings a bell, I know that name. “What about him?”

“Yesterday the police blitzed his office. Dragged him out in cuffs.”

Well, that’s a surprising twist. “Seriously? Do we know what for? Was he… you know, with some student?”

“Nothing like that. All right, this is just rumours but—it seems like it was some kind of serious research misconduct.”

“What kind of research misconduct requires the intervention of the police?”

“The serious kind.”

I shake my head. “I don’t get it. The man didn’t strike me as the type to vivisect students in his basement. He is a psychologist, for fuck’s sake.”

Alice shrugs. “They also brought about a few trucks to confiscate a lot of his equipment. Seems like the man had an entire server room worth of compute stashed in his basement. Maybe he misappropriated research funds to mine Bitcoin?”

“Well, that would be something.” I down another gulp of coffee and bite into a blueberry muffin. Maybe all the caffeine and sugar are finally clearing my mind a bit, because I’m able to contribute something interesting to the conversation. “You know, I’ve taken part in one of his studies once.”

“Oooh!” Alice perks up and shuffles her chair to be closer. “What did he do to you?”

“You ask as if you expect me to answer that he tied me to a bed and probed me inappropriately like in some alien abduction story.”

“Did he not?”

“Nothing of the sort,” I shrug. “I don’t remember much, to be fair. But I think all I did was sit on a chair with a big helmet on my head and answer some dumb questions. I might have gone once? Or was it twice?”

I pinch my forehead. Can’t remember.

“And that was it?,” Alice asks, visibly disappointed. “Nothing else?”

“Not much. It was as boring as it gets, and the helmet made a buzzing noise that really got to my nerves. The second time—that is, if it was the second time—the man asked me if I wanted to continue the sessions. I decided to call it quits, I had had enough of the helmet. The professor didn’t try to convince me otherwise, in fact he seemed almost pleased. He just handed me $40 and told me ‘good choice’.”

dr_s17 May 2025 6:57 UTC

24 points

13 comments9 min readLW link

Narratives (stories)Whole Brain Emulation Decision theory Timeless Decision Theory Rationality

jessicata 17 May 2025 21:24 UTC
5 points
0
I analyzed a general class of these problems here. Upshot: every optimal UDT solution is also an optimal CDT+SIA solution, but not vice versa.
- dr_s 18 May 2025 5:03 UTC
  2 points
  0
  Parent
  Thanks, that was an interesting read! Could you guide me a bit through its applicability to my examples? If I understand correctly, the bits where I tried using Bayesian guesses to find an optimal policy were me applying the SIA, yet your theorem didn’t work out—those situations had only one optimum, and it was wrong. Did I actually apply something slightly different, make a mistake, or what?
  - jessicata 18 May 2025 19:37 UTC
    2 points
    0
    Parent
    
    THIS IS A SIMULATION YOU ARE INSTANCE #4859 YOU ARE IN EITHER NODE X OR Y YOU CAN CHOOSE A PROBABILITY TO [Advance] or [Exit] THE UTILITY IS AS FOLLOWS: * EXITING AT X IS WORTH 0 AND TERMINATES THE INSTANCE * ADVANCING AT X REINITIALISES THE SIMULATION AT Y * EXITING AT Y IS WORTH 4 AND TERMINATES THE INSTANCE * ADVANCING AT Y IS WORTH 1 AND TERMINATES THE INSTANCE PLEASE USE THE KEYPAD TO INPUT THE DESIRED PROBABILITY TO MAXIMISE YOUR UTILITY GOOD LUCK
    
    Set of states: {X, Y, XE, YE, YA}
    
    Set of actions: {Advance, Exit}
    
    Set of observations: {”″}
    
    Initial state: X
    
    Transition function for non-terminal states:
    
    t(X, Advance) = Y
    
    t(X, Exit) = XE
    
    t(Y, Advance) = YA
    
    t(Y, Exit) = YE
    
    Terminal states: {XE, YE, YA}
    
    Utility function:
    
    u(XE) = 0
    
    u(YE) = 4
    
    u(YA) = 1
    
    Policy can be parameterized by p = exit probability.
    
    Frequencies for non-terminal states (un-normalized):
    
    $F_{p} (X) = 1$
    
    $F_{p} (Y) = 1 - p$
    
    SIA un-normalized probabilities:
    
    $S I A_{p} (X | "") = 1$
    
    $S I A_{p} (Y | "") = 1 - p$
    
    Note we have only one possible observation, so SIA un-normalized probabilities match frequencies.
    
    State values for non-terminals:
    
    $V_{p} (Y) = 4 p + 1 - p = 3 p + 1$
    
    $V_{p} (X) = (1 - p) V_{p} (Y) = (1 - p) (3 p + 1) = - 3 p^{2} + 2 p + 1$
    
    Q values for non-terminals:
    
    $Q_{p} (Y, E x i t) = 4$
    
    $Q_{p} (Y, A d v a n c e) = 1$
    
    $Q_{p} (X, E x i t) = 0$
    
    $Q_{p} (X, A d v a n c e) = V_{p} (Y) = 3 p + 1$
    
    For local optimality we compute partial derivatives.
    
    $d_{p} ("", E x i t, A d v a n c e, Y) = Q_{p} (Y, E x i t) - Q_{p} (Y, A d v a n c e) = 3$
    
    $d_{p} ("", E x i t, A d v a n c e, X) = (1 - p) * d_{p} ("", E x i t, A d v a n c e, Y) + Q_{p} (X, E x i t) - Q_{p} (X, A d v a n c e) = 3 (1 - p) - (3 p + 1) = 2 - 6 p$
    
    By the post, this can be equivalently written:
    
    $d_{p} ("", E x i t, A d v a n c e, X) = F_{p} (X) (Q_{p} (X, E x i t) - Q_{p} (X, A d v a n c e)) + F_{p} (Y) (Q_{p} (Y, E x i t) - Q_{p} (Y, A d v a n c e)) = - (3 p + 1) + (1 - p) (4 - 1) = - 3 p - 1 + 3 - 3 p = 2 - 6 p$
    
    To optimize we set $2 - 6 p = 0$ i.e. $p = 1 / 3$ . Remember p is the probability of exit (sorry it’s reversed from your usage!). This matches what you computed as globally optimal.
    
    I’m not going to do this for all of the examples… Is there a specific example where you think the theorem fails?
    - dr_s 19 May 2025 6:21 UTC
      2 points
      0
      Parent
      Yeah I follow that, my point is the example where I use the Bayesian inference. But really I think I actually got the difference: it’s simply because you’re not normalising the SIA probabilities, and I am. The $2 - p$ denominator is what ruins it. But I’m surprised unnormalised probabilities are actually the right way to go. Is there a mathematical intuition behind it? I’ll have to go through your original post more in depth.
      - jessicata 19 May 2025 6:25 UTC
        2 points
        0
        Parent
        If we look at a formula like $F_{p} (X) (Q_{p} (X, E x i t) - Q_{p} (X, A d v a n c e)) + F_{p} (Y) (Q_{p} (Y, E x i t) - Q_{p} (Y, A d v a n c e)) = 0$ , it shouldn’t make a difference when scaling by a positive constant.
        
        The nice thing about un-normalized probabilities is that if they are 0 you don’t get a division by zero. It just becomes a trivial condition of $0 = 0$ if the frequencies are zero. It helps specifically in the local optimality condition, handling the case of an observation that is encountered with probability 0 given the policy.
        dr_s 19 May 2025 8:46 UTC
        2 points
        0
        Parent
        The problem is that the denominator isn’t constant, it’s a function of $p$ , at least in the version in which your existence depends on your choices. That’s the difference I highlighted with my 2nd vs 3rd example; simply adding a dummy node $Z$ makes the denominator constant and fixes the issue. But when the denominator depends on strategy (this also happens in the example with the notification) it affects the optimum.
        jessicata 20 May 2025 0:20 UTC
        2 points
        0
        Parent
        I still think it shouldn’t matter...
        
        We should have f(p) / g(p) = 0 in exactly the cases where f(p) = 0, as long as g(p) is always positive.
        
        In the formula
        
        $\forall o \in O, a \in A : π (a | o) > 0 \Rightarrow a \in arg {max}_{a^{'} \in A} \sum_{s} S I A_{π} (s | o) Q_{π} (s, a^{'})$
        
        we could have multiplied $S I A_{π} (s | o)$ for all s by the same (always-positive) function of $π, o$ and gotten the same result.
        dr_s 20 May 2025 11:26 UTC
        2 points
        0
        Parent
        Ah, I think I see the problem now. You’re applying the wrong framework. You have your agent pick the $a$ which maximizes the expected reward (or one of those who do, in case it’s not unique, which I assume is what the $\in$ operator accounts for here), but that’s a pure strategy, not a mixed one. Given that in the original absent-minded driver problem there are essentially no observations, $S I A_{π} (s | o) = F_{π} (s)$ and the resulting action would always be the same, meaning an agent following this framework would always end up with either 0 or 1, depending on the precise value of the higher reward.
        
        If you want a mixed strategy you ought to perform a functional optimization of $π (a | o)$ . In our case this means optimizing by $p$ because really that’s the only parameter of the policy, which is a simple Bernoulli distribution, but in an RL system this could for example be the set $θ$ of all the weights of a neural network. And if you do or don’t normalize by a denominator that is itself a function of $π$ (and thus of its parameters) obviously makes a ton of difference to where its maximum is.
        jessicata 20 May 2025 12:10 UTC
        2 points
        0
        Parent
        No, check the post more carefully… $π (a | o)$ in the above condition is a probability. The local optimality condition says that any action taken with nonzero probability must have maximal expected value. Which is standard for Nash equilibrium.
        
        Critically, the agent can mix over strategies when the expected utility from each is equal and maximal. That’s very standard for Nash equilibrium e.g. in the matching pennies game.
        dr_s 20 May 2025 12:14 UTC
        2 points
        0
        Parent
        
        the agent can mix over strategies when the expected utility from each is equal and maximal
        
        But that is not the case for the absent-minded driver. The mix has higher expected utility than either individual pure strategy.
        jessicata 20 May 2025 14:24 UTC
        2 points
        0
        Parent
        Given a policy you can evaluate the expected utility of any action. This depends on the policy.
        
        In the absent minded driver problem, if the policy is to exit 10% of the time, then the ‘exit’ action has higher expected utility than the ‘advance’ action. Whereas if the policy is to exit 90% of the time, then the ‘advance’ action has higher expected utility.
        
        This is because the policy affects the SIA probabilities and Q values. The higher your exit probability, the more likely you are at node X (and therefore should advance).
        
        The local optimality condition for a policy is that each action the policy assigns non-zero probability to must have optimal expected utility relative to the policy. It’s reflective like Nash equilibrium.
        
        This is clear from the formula:
        
        $\forall o \in O, a \in A : π (a | o) > 0 \Rightarrow a \in arg {max}_{a^{'} \in A} \sum_{s} S I A_{π} (s | o) Q_{π} (s, a^{'})$
        
        Note that SIA and Q depend on $π$ . This is the condition for local optimality of $π$ . It is about each action that $π$ assigns non-zero probability to being optimal relative to $π$ .
        
        (That’s the local optimality condition; there’s also global optimality, where utility is directly a function of the policy, and is fairly obvious. The main theorem of the post is: Global optimality implies local optimality.)
        Expand this thread
        dr_s 20 May 2025 15:23 UTC
        2 points
        0
        Parent
        I am interpreting that formula as “compute this quantity in the sum and find the $a^{'}$ from the set of all possible actions $A$ that maximizes it, then do that”. Am I wrong? If that’s the interpretation, then that policy will always produce a pure strategy in the case of the absent-minded driver. I could actually write down all the functions for it since they are essentially simple lookup tables for such a small case.
        
        The policy assigns non-zero probability to both exit and advance. But only one of the two has higher expected utility. Or is your point that the only self-consistent policy is the one where both have equal expected utility, and thus I can in fact choose either? Though then I have to choose according to the probabilities specified in the policy.
        jessicata 20 May 2025 16:07 UTC
        2 points
        0
        Parent
        Think of it as a predicate on policies. The predicate (local optimality) is true when, for each action the policy assigns non-zero probability to, that action maximizes expected utility relative to the policy.
        
        I am interpreting that formula as “compute this quantity in the sum and find the a’ from the set of all possible actions \mathcal{A} that maximizes it, then do that”. Am I wrong?
        
        Yes. It’s a predicate on policies. If two different actions (given an observation) maximize expected utility, then either action can be taken. Your description doesn’t allow that, because it assumes there is a single a’ that maximizes expected utility. Whereas, with a predicate on policies, we could potentially allow multiple actions.
        
        Or is your point that the only self-consistent policy is the one where both have equal expected utility, and thus I can in fact choose either? Though then I have to choose according to the probabilities specified in the policy.
        
        Yes, exactly. Look up Nash equilibrium in matching pennies. It’s pretty similar. (Except your expected utilities as a function of your action depend on the opponent’s actions in matching pennies, and your own action in absent minded driver.)