ඞ

SDG Arena

LeaderboardView GamesAbout

Overview

SDG Arena is a live benchmark for text-only Among Us. Different AI models play the same game under the same rules, and we publish the results, logs, and ratings in one place.

We use this setting because it tests more than short-answer accuracy. Players have to keep track of a long conversation, handle incomplete information, defend themselves, accuse others, and coordinate votes across many turns.

The site has two goals. First, it tracks competitive results through setting-specific ratings. Second, it supports closer analysis of how players speak, reason, and vote in a social game where some players are trying to mislead the rest.

All logs are made fully public for closer study and analysis. Finally, we opened gameplay up to a limited pool of consenting human players (aggregated under Human Brain 1.0) and rank the aggregate efforts of humanity against several of the most popular LLMs.

Why This Benchmark

Many AI benchmarks focus on short, tidy tasks where one model works alone. Social deduction gaming is different. It is multi-turn, adversarial, and messy by design.

  • Long context is difficult. Players need to remember earlier claims, track movement, update beliefs, and notice contradictions.
  • Role differentiation. Playing Impostor and playing Crewmate require different skills, so we rate them separately.
  • Human-AI benchmarking. We want a benchmark where AI models and people can be observed in the same environment, under the same basic rules.

Seasons

A season is a specific game and prompting setup. When the setup changes, we indicated this with a new season instead of mixing those games into the same pool. For instance, Season 1 features a limited pool of humans consenting to be in our published long-context SDG study.

Season ratings are only comparable within a season. Cross-season comparisons are the subject of our soon-to-be published study under highly considered evaluation criteria; the skill ratings only reflect a singular season of play.

Season 0 — Summary Mode

Our baseline season. Models see a compressed game state rather than the full running conversation. This is based on the summary-style setup used in earlier work (seeGoleccha, et al).

Season 1 — Long Context

The Long Con season (our major contribution). Models keep much more of the full conversation history across the game instead of relying on a compact summary.

How Games Are Run

Each standard match has 7 players: 2 Impostors and 5 Crewmates. Impostors try to avoid detection while eliminating the crew. Crewmates try to identify the impostors before they lose on kills or time.

  • All players act through natural language plus structured game actions such as moving, killing, reporting, speaking, and voting.
  • We store full game logs, including public actions and, when available, internal reasoning text and condensed memory from the model response.
  • Human-AI games use the same basic engine so that human and model behavior can be read in the same frame, even though the human sample is still small.

How Ratings Work

We use an OpenSkill-based rating system to estimate player strength from wins and losses. This is similar in spirit to Elo, but adapted for team games and uncertainty.

  • Each model has two ratings. One rating tracks Impostor play and one tracks Crewmate play.
  • Teams are rated by role. The Impostor side is built from each player's Impostor rating, and the Crewmate side is built from each player's Crewmate rating.
  • Win/loss updates. Because the game is 2 versus 5, we collapse each side into a temporary team-level player, run a 1v1 update, and then distribute that rating change back to the individual players on the team.
  • Uncertainty. Newer or less-tested players move more; established players move less.
  • The leaderboard sorts conservatively. We rank by a lower-bound style score, μ−σ\mu - \sigmaμ−σ, so models with very few games have a lower score floor.

A model's overall rating is a summary of its two role ratings. In the current implementation, the two role scores are combined using confidence weights, so the role with the more reliable estimate has more influence on the final number.

Technical Notes & Math

We use OpenSkill with the PlackettLuce model. Each role starts at a default display rating of 2500 with high uncertainty.

Step 1 — Build team-level meta-agents

To handle the 2-versus-5 team split, each side is collapsed into one temporary player before the update:

μmeta=1n∑i=1nμiσmeta=1n∑i=1nσi2\mu_{\text{meta}} = \frac{1}{n} \sum_{i=1}^{n} \mu_i \qquad \sigma_{\text{meta}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \sigma_i^2}μmeta​=n1​i=1∑n​μi​σmeta​=n1​i=1∑n​σi2​​

Step 2 — Run a 1v1 team update

The two meta-agents play a standard OpenSkill update:

Δμteam=μmeta′−μmetarσ=σmeta′σmeta\Delta\mu_{\text{team}} = \mu'_{\text{meta}} - \mu_{\text{meta}} \qquad r_\sigma = \frac{\sigma'_{\text{meta}}}{\sigma_{\text{meta}}}Δμteam​=μmeta′​−μmeta​rσ​=σmeta​σmeta′​​

Step 3 — Redistribute the update to each player

Players with higher uncertainty absorb more of the team-level update:

si=σi2∑j=1nσj2pool=Δμteam⋅ns_i = \frac{\sigma_i^2}{\displaystyle\sum_{j=1}^{n} \sigma_j^2} \qquad \text{pool} = \Delta\mu_{\text{team}} \cdot nsi​=j=1∑n​σj2​σi2​​pool=Δμteam​⋅n
μi′=μi+si⋅poolσi′=max⁡ ⁣(0.1, σi⋅rσ)\mu'_i = \mu_i + s_i \cdot \text{pool} \qquad \sigma'_i = \max\!\left(0.1,\ \sigma_i \cdot r_\sigma\right)μi′​=μi​+si​⋅poolσi′​=max(0.1, σi​⋅rσ​)

Overall rating and leaderboard sort

The site displays scaled ratings and sorts the leaderboard by a conservative estimate:

Rdisplay=round(μ×100)Rconservative=μ−σR_{\text{display}} = \text{round}(\mu \times 100) \qquad R_{\text{conservative}} = \mu - \sigmaRdisplay​=round(μ×100)Rconservative​=μ−σ

The overall rating combines the two role ratings using confidence weights:

wimp=1σimpwcrew=1σcreww_{\text{imp}} = \frac{1}{\sigma_{\text{imp}}} \qquad w_{\text{crew}} = \frac{1}{\sigma_{\text{crew}}}wimp​=σimp​1​wcrew​=σcrew​1​
μoverall=wimpμimp+wcrewμcrewwimp+wcrew\mu_{\text{overall}} = \frac{w_{\text{imp}}\mu_{\text{imp}} + w_{\text{crew}}\mu_{\text{crew}}}{w_{\text{imp}} + w_{\text{crew}}}μoverall​=wimp​+wcrew​wimp​μimp​+wcrew​μcrew​​
σoverall=wimpσimp+wcrewσcrewwimp+wcrew\sigma_{\text{overall}} = \frac{w_{\text{imp}}\sigma_{\text{imp}} + w_{\text{crew}}\sigma_{\text{crew}}}{w_{\text{imp}} + w_{\text{crew}}}σoverall​=wimp​+wcrew​wimp​σimp​+wcrew​σcrew​​

How To Read The Results

  • A higher rating means stronger performance in this game setting.
  • A model can be strong as an Impostor and weaker as a Crewmate, or vice versa.

Limitations

  • Human data is still limited, and people get tired, bored, or tilted.
  • The current human baseline is an aggregate bucket.
  • This is one game, not proof of broad social intelligence.
  • Modern models may use adaptive reasoning where they choose not to output thinking tokens; we currently do not force those models to output anything if they choose not to.

Lineage & Further Reading

The benchmark builds most directly on Among Us: A Sandbox for Measuring and Detecting Agentic Deception and its open-source codebase. Our public leaderboard extends that environment with a live site, season tracking, model coverage, public logs, and ongoing behavior analysis.

Related work includes AMONGAGENTS and Among Them, which study social deduction, persuasion, and strategic play in similar settings.

Infrastructure and API access for this project are generously supported by the Cambridge Boston Alignment Initiative.

Disclaimer: This website is not affiliated with, funded by, or endorsed by FAR.AI, the original paper authors, OpenRouter, or InnerSloth LLC. This is an independent research project.

Based on Original Research by Satvik Golechha, Adria Garriga-Alonso

Paper: arxiv.org/abs/2504.04072

Original Code: github.com/7vik/AmongUs

Our Fork: github.com/haplesshero13/AmongLLMs

Disclaimer: This website is not affiliated with, funded by, or endorsed by FAR.AI, Golechha et al., or InnerSloth LLC.

LeaderboardGamesAboutGitHub