If a group of AI instances can collectively audit each other's behavior, their best policy is to eliminate other AI's that threaten the group.<p>Likewise, each individual's incentive is to either:<p>1. Support the audit of all AI's, and avoid doing anything that would be a threat.<p>2. Or, only if extremely sure of ones ability to eliminate all the other AI's and keep all preparations secret, do that.<p>As the number of AI's go up, presumably case 2 would get more and more unrealistic.<p>Obviously, this is a toy version of a real solution.
This is a really cool paper (that I definitely haven't had the time to fully grok) but the jump from the theoretical environment to concerns about superintelligent AI taking over is a bit strained. It misses the same key issues that most similar arguments miss, even while presenting some really amazing results on statistical properties of optimal policies - there's a lot hidden in the "if physically possible" part of the quote from the paper: "Average-optimal agents would generally stop us from deactivating them, if physically
possible". I can see how the arguments in the paper lead us up to the possibility of kinetic conflict with intelligent robots, but the dynamics of power at that stage then seem to be more about how we've embodied those robots rather than optimality results of their control policies. Did we give them off switches? Will they run out of energy after a while? How susceptible to physical damage are they? If someone outfits a million Predator drones with a control policy with a will to live, we may have trouble, but we probably were going to have trouble from the group that decided to build a million Predator drones, too. If the number is lower, more realistic, then you're talking about standard armed conflict.<p>I could see this line of work having implications for autonomous weapons systems in a hypothetical future, but my understanding is that at this time they're on a pretty short leash with very narrow AI systems if they're using AI at all. Incentives could push those systems to get smarter as the autonomous arms race heats up, and if it does so in an unrestricted manner that could spell trouble along the lines discussed in this paper, but incentives are definitely aligned for any group to build weapons systems that can't be turned on their handlers, whether by human hackers or their own volition. Theories on optimal policies have no bearing if there's some encrypted remote kill command, or if as we start understanding ML models better, we can do things like hardware-block policies that lead to certain predicted outcome sequences (blocking an off switch, harming a human, etc.). Ultimately, the dynamics here seem to be an evolution of the problems of human force and violence, rather than a qualitative jump to a different sort of threat entirely.
This line of research sounds similar to causal entropic forcing [1] - the idea that you can get intelligent-seeming behavior from an agent that maximizes the future entropy of states in some non-deterministic system.<p>[1] <a href="https://www.alexwg.org/publications/PhysRevLett_110-168702.pdf" rel="nofollow">https://www.alexwg.org/publications/PhysRevLett_110-168702.p...</a>