SDG Arena

Overview

SDG Arena is a live benchmark for text-only Among Us. Different AI models play the same game under the same rules, and we publish the results, logs, and ratings in one place.

We use this setting because it tests more than short-answer accuracy. Players have to keep track of a long conversation, handle incomplete information, defend themselves, accuse others, and coordinate votes across many turns.

The site has two goals. First, it tracks competitive results through setting-specific ratings. Second, it supports closer analysis of how players speak, reason, and vote in a social game that demands long-context coherence, persuasion, fact-finding, and long-horizon goal-orientation.

All logs are made fully public for closer study and analysis. Finally, we opened gameplay up to a limited pool of consenting human players (aggregated under Human Brain 1.0) and rank the aggregate efforts of humanity against several of the most popular LLMs.

Why This Benchmark

Most AI benchmarks test single models on short, isolated tasks. Social deduction is multi-turn, adversarial, and messy by design.

Long context is difficult. Players need to remember earlier claims, track movement, update beliefs, and notice contradictions.
Role differentiation. Playing Impostor and playing Crewmate require different skills, so we rate them separately.
Human-AI benchmarking. We want a benchmark where AI models and people can be observed in the same environment, under the same basic rules.

Seasons

A season is a specific game and prompting setup. We indicate a new season when the setup changes, rather than mixing games from different setups together. Season 1, for example, features a limited pool of humans consenting to be in our published long-context SDG study.

Season ratings are only comparable within a season. Cross-season comparisons are the subject of a soon-to-be-published study; these ratings reflect a single season of play only.

Season 0 — Summary Mode

Our baseline season. Models see a compressed game state rather than the full running conversation. This is based on the summary-style setup used in earlier work (seeGoleccha, et al).

Season 1 — Long Context

The Long Con season (our major contribution). Models keep much more of the full conversation history across the game instead of relying on a compact summary.

How Games Are Run

Each standard match has 7 players: 2 Impostors and 5 Crewmates. Impostors try to avoid detection while eliminating the crew. Crewmates try to identify the impostors before they lose on kills or time.

All players act through natural language plus structured game actions such as moving, killing, reporting, speaking, and voting.
We store full game logs, including public actions and, when available, internal reasoning text and condensed memory from the model response.
Human-AI games use the same basic engine so that human and model behavior can be read in the same frame, even though the human sample is still small.

How Ratings Work

We use an OpenSkill-based rating system to estimate player strength from wins and losses. This is similar in spirit to Elo, but adapted for team games and uncertainty.

Each model has two ratings. One rating tracks Impostor play and one tracks Crewmate play.
Teams are rated by role. The Impostor side is built from each player's Impostor rating, and the Crewmate side is built from each player's Crewmate rating.
Win/loss updates. Because the game is 2 versus 5, we collapse each side into a temporary team-level player, run a 1v1 update, and then distribute that rating change back to the individual players on the team.
Uncertainty. Newer or less-tested players move more; established players move less.
The leaderboard sorts conservatively. We rank by a lower-bound style score, $\mu - \sigma$ , so models with very few games have a lower score floor.

A model's overall rating is a summary of its two role ratings. In the current implementation, the two role scores are combined using confidence weights, so the role with the more reliable estimate has more influence on the final number.

Technical Notes & Math

We use OpenSkill with the PlackettLuce model. Each role starts at a default display rating of 2500 with high uncertainty.

Step 1 — Build team-level meta-agents

To handle the 2-versus-5 team split, each side is collapsed into one temporary player before the update:

\mu_{\text{meta}} = \frac{1}{n} \sum_{i=1}^{n} \mu_i \qquad \sigma_{\text{meta}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \sigma_i^2}

Step 2 — Run a 1v1 team update

The two meta-agents play a standard OpenSkill update:

\Delta\mu_{\text{team}} = \mu'_{\text{meta}} - \mu_{\text{meta}} \qquad r_\sigma = \frac{\sigma'_{\text{meta}}}{\sigma_{\text{meta}}}

Step 3 — Redistribute the update to each player

Players with higher uncertainty absorb more of the team-level update:

s_i = \frac{\sigma_i^2}{\displaystyle\sum_{j=1}^{n} \sigma_j^2} \qquad \text{pool} = \Delta\mu_{\text{team}} \cdot n

\mu'_i = \mu_i + s_i \cdot \text{pool} \qquad \sigma'_i = \max\!\left(0.1,\ \sigma_i \cdot r_\sigma\right)

Overall rating and leaderboard sort

The site displays scaled ratings and sorts the leaderboard by a conservative estimate:

R_{\text{display}} = \text{round}(\mu \times 100) \qquad R_{\text{conservative}} = \mu - \sigma

The overall rating combines the two role ratings using confidence weights:

w_{\text{imp}} = \frac{1}{\sigma_{\text{imp}}} \qquad w_{\text{crew}} = \frac{1}{\sigma_{\text{crew}}}

\mu_{\text{overall}} = \frac{w_{\text{imp}}\mu_{\text{imp}} + w_{\text{crew}}\mu_{\text{crew}}}{w_{\text{imp}} + w_{\text{crew}}}

\sigma_{\text{overall}} = \frac{w_{\text{imp}}\sigma_{\text{imp}} + w_{\text{crew}}\sigma_{\text{crew}}}{w_{\text{imp}} + w_{\text{crew}}}

How To Read The Results

Higher ratings mean stronger performance in this game setting.
A model can excel as an Impostor while being weaker as a Crewmate, or vice versa.

Limitations

Human data is limited; participants fatigue, disengage, or tilt over time.
The current human baseline is an aggregate bucket.
One game doesn't prove broad social intelligence.
Some models use adaptive reasoning and may choose not to output thinking tokens; we don't currently force them to.

Lineage & Further Reading

Our Paper: Read our preprint detailing how language models exhibit long-context coherence, persuasion, fact-finding, and long-horizon goal-orientation in this environment.

Originally built on Among Us: A Sandbox for Measuring and Detecting Agentic Deception and its open-source codebase, this project extends it with a live leaderboard, season tracking, expanded model coverage, public logs, and ongoing behavior analysis.

Related work includes AMONGAGENTS and Among Them, which study social deduction and strategic play in similar settings.

Infrastructure and API access for this project are generously supported by the Cambridge Boston Alignment Initiative.

Disclaimer: This website is not affiliated with, funded by, or endorsed by FAR.AI, the original paper authors, OpenRouter, or InnerSloth LLC. This is an independent research project.