AI Misalignment: Full Human Agreement Impossible

PNAS Nexus

Perfect AI alignment with human values and interests is mathematically impossible, according to a study, but behavioral diversity among AI agents offers the promise of some control. Hector Zenil and colleagues used Gödel's incompleteness theorem and Turing's undecidability result for the Halting Problem to show that any LLM complex enough to exhibit general intelligence or superintelligence will also be computationally irreducible and produce unpredictable behavior, making forced alignment impossible. As an alternative, the authors propose a strategy of "managed misalignment," in which competing AI agents with different cognitive styles and partially overlapping goals operate in distinct roles to check one another.

As each agent attempts to fulfill its own goals with its own modes of reasoning and ethical frameworks—what the authors dub "artificial agentic neurodivergence"—the agents will dynamically aid or thwart one another, preempting ultimate dominance by any single system. The authors simulated a "cognitive ecosystem" by prompting AI interacting agents to represent fully aligned behaviors such as optimizing human utility, partially aligned behaviors such as prioritizing the environment, or unaligned behaviors, pursuing arbitrary objectives.

The authors trialed this approach in ethical debates between a range of LLMs in which humans or prompted LLMs tried to disrupt emerging consensus. In these debates, open models showed a wider spectrum of perspectives than proprietary models, creating what the authors characterize as a more resilient AI ecosystem, one that is less likely to converge on a single opinion—which could be harmful in cases where that opinion is not aligned with human interests.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

You might also like