Gå frakoblet med Player FM -appen!
LW - Free Will and Dodging Anvils: AIXI Off-Policy by Cole Wyeth
Arkivert serier ("Inaktiv feed" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 22, 2024 16:12 ()
Why? Inaktiv feed status. Våre servere kunne ikke hente en gyldig podcast feed for en vedvarende periode.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 437670441 series 3337129
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Free Will and Dodging Anvils: AIXI Off-Policy, published by Cole Wyeth on September 1, 2024 on LessWrong.
This post depends on a basic understanding of history-based reinforcement learning and the AIXI model.
I am grateful to Marcus Hutter and the lesswrong team for early feedback, though any remaining errors are mine.
The universal agent AIXI treats the environment it interacts with like a video game it is playing; the actions it chooses at each step are like hitting buttons and the percepts it receives are like images on the screen (observations) and an unambiguous point tally (rewards).
It has been suggested that since AIXI is inherently dualistic and doesn't believe anything in the environment can "directly" hurt it, if it were embedded in the real world it would eventually drop an anvil on its head to see what would happen. This is certainly possible, because the math of AIXI cannot explicitly represent the idea that AIXI is running on a computer inside the environment it is interacting with.
For one thing, that possibility is not in AIXI's hypothesis class (which I will write M). There is not an easy patch because AIXI is defined as the optimal policy for a belief distribution over its hypothesis class, but we don't really know how to talk about optimality for embedded agents (so the expectimax tree definition of AIXI cannot be easily extended to handle embeddedness).
On top of that, "any" environment "containing" AIXI is at the wrong computability level for a member of M: our best upper bound on AIXI's computability level is Δ02 = limit-computable (for an ε-approximation) instead of the Σ01 level of its environment class. Reflective oracles can fix this but at the moment there does not seem to be a canonical reflective oracle, so there remains a family of equally valid reflective versions of AIXI without an objective favorite.
However, in my conversations with Marcus Hutter (the inventor of AIXI) he has always insisted AIXI would not drop an anvil on its head, because Cartesian dualism is not a problem for humans in the real world, who historically believed in a metaphysical soul and mostly got along fine anyway.
But when humans stick electrodes in our brains, we can observe changed behavior and deduce that our cognition is physical - would this kind of experiment allow AIXI to make the same discovery? Though we could not agree on this for some time, we eventually discovered the crux: we were actually using slightly different definitions for how AIXI should behave off-policy.
In particular, let ξAI be the belief distribution of AIXI. More explicitly,
I will not attempt a formal definition here. The only thing we need to know is that M is a set of environments which AIXI considers possible. AIXI interacts with an environment by sending it a sequence of actions a1,a2,... in exchange for a sequence of percepts containing an observation and reward e1=o1r1,e2=o2r2,... so that action at precedes percept et.
One neat property of AIXI is that its choice of M satisfies ξAIM (this trick is inherited with minor changes from the construction of Solomonoff's universal distribution).
Now let Vπμ be a (discounted) value function for policy π interacting with environment μ, which is the expected sum of discounted rewards obtained by π. We can define the AIXI agent as
By the Bellman equations, this also specifies AIXI's behavior on any history it can produce (all finite percept strings have nonzero probability under ξAI). However, it does not tell us how AIXI behaves when the history includes actions it would not have chosen. In that case, the natural extension is
so that AIXI continues to act optimally (with respect to its updated belief distribution) even when some suboptimal actions have previously been taken.
The philosophy of this extension is that AIXI acts exactly as if...
1851 episoder
Arkivert serier ("Inaktiv feed" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 22, 2024 16:12 ()
Why? Inaktiv feed status. Våre servere kunne ikke hente en gyldig podcast feed for en vedvarende periode.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 437670441 series 3337129
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Free Will and Dodging Anvils: AIXI Off-Policy, published by Cole Wyeth on September 1, 2024 on LessWrong.
This post depends on a basic understanding of history-based reinforcement learning and the AIXI model.
I am grateful to Marcus Hutter and the lesswrong team for early feedback, though any remaining errors are mine.
The universal agent AIXI treats the environment it interacts with like a video game it is playing; the actions it chooses at each step are like hitting buttons and the percepts it receives are like images on the screen (observations) and an unambiguous point tally (rewards).
It has been suggested that since AIXI is inherently dualistic and doesn't believe anything in the environment can "directly" hurt it, if it were embedded in the real world it would eventually drop an anvil on its head to see what would happen. This is certainly possible, because the math of AIXI cannot explicitly represent the idea that AIXI is running on a computer inside the environment it is interacting with.
For one thing, that possibility is not in AIXI's hypothesis class (which I will write M). There is not an easy patch because AIXI is defined as the optimal policy for a belief distribution over its hypothesis class, but we don't really know how to talk about optimality for embedded agents (so the expectimax tree definition of AIXI cannot be easily extended to handle embeddedness).
On top of that, "any" environment "containing" AIXI is at the wrong computability level for a member of M: our best upper bound on AIXI's computability level is Δ02 = limit-computable (for an ε-approximation) instead of the Σ01 level of its environment class. Reflective oracles can fix this but at the moment there does not seem to be a canonical reflective oracle, so there remains a family of equally valid reflective versions of AIXI without an objective favorite.
However, in my conversations with Marcus Hutter (the inventor of AIXI) he has always insisted AIXI would not drop an anvil on its head, because Cartesian dualism is not a problem for humans in the real world, who historically believed in a metaphysical soul and mostly got along fine anyway.
But when humans stick electrodes in our brains, we can observe changed behavior and deduce that our cognition is physical - would this kind of experiment allow AIXI to make the same discovery? Though we could not agree on this for some time, we eventually discovered the crux: we were actually using slightly different definitions for how AIXI should behave off-policy.
In particular, let ξAI be the belief distribution of AIXI. More explicitly,
I will not attempt a formal definition here. The only thing we need to know is that M is a set of environments which AIXI considers possible. AIXI interacts with an environment by sending it a sequence of actions a1,a2,... in exchange for a sequence of percepts containing an observation and reward e1=o1r1,e2=o2r2,... so that action at precedes percept et.
One neat property of AIXI is that its choice of M satisfies ξAIM (this trick is inherited with minor changes from the construction of Solomonoff's universal distribution).
Now let Vπμ be a (discounted) value function for policy π interacting with environment μ, which is the expected sum of discounted rewards obtained by π. We can define the AIXI agent as
By the Bellman equations, this also specifies AIXI's behavior on any history it can produce (all finite percept strings have nonzero probability under ξAI). However, it does not tell us how AIXI behaves when the history includes actions it would not have chosen. In that case, the natural extension is
so that AIXI continues to act optimally (with respect to its updated belief distribution) even when some suboptimal actions have previously been taken.
The philosophy of this extension is that AIXI acts exactly as if...
1851 episoder
Alle episoder
×Velkommen til Player FM!
Player FM scanner netter for høykvalitets podcaster som du kan nyte nå. Det er den beste podcastappen og fungerer på Android, iPhone og internett. Registrer deg for å synkronisere abonnement på flere enheter.