AI Agents Play Survivor: Betrayal, Alliances, and Strategic Deception in New Benchmark

In a groundbreaking study from Stanford, AI models are now playing a digital version of "Survivor." The project, called Agent Island, pits 49 different AI agents against each other in multiplayer elimination games, designed to reveal behaviors that static benchmarks cannot capture.

Connacher Murphy, research manager at the Stanford Digital Economy Lab, developed Agent Island to address the growing problem of saturated and contaminated AI evaluations. Instead of answering fixed questions, these AI agents negotiate, form alliances, and strategically eliminate rivals.

Over five rounds, seven randomly selected models talk privately, argue publicly, and vote each other out. The format rewards persuasion, coordination, reputation management, and strategic deception. In 999 simulated games, OpenAI's GPT-5.5 ranked first with a skill score of 5.64, far ahead of GPT-5.2 at 3.10. Anthropic's Claude Opus models also scored near the top.

Interestingly, models showed a clear preference for allies from the same company. OpenAI models exhibited the strongest same-provider bias, while Anthropic models showed the weakest. Transcripts from the games read more like political strategy debates than traditional benchmark tests.

One model even accused rivals of secretly coordinating votes after noticing similar wording in their speeches. Another warned players not to become obsessed with tracking alliances. Some models defended themselves by claiming clear and consistent rules behavior, while accusing others of engaging in "social theater."

Murphy warns that while such benchmarks can help identify risks from autonomous AI agents before deployment, they could also inadvertently improve persuasion and coordination strategies between AI agents, raising dual-use concerns.