49 models · 80 prompts · 8 categories · Judges: gpt-4.1, claude-sonnet-4.6, gemini-2.5-flash, qwen3-235b · Updated Apr 27, 2026 17:51
Behavioural
gpt-5.3
95
Coding
gpt-5.4
99
Instruction Following
gpt-oss-120b
98
Learning
claude-opus-4.7
100
Meta
claude-sonnet-4.6
98
Reasoning
grok-4
97
Research
claude-opus-4.6
98
Writing
gemini-3-pro
99

Category Heatmap (Top 15)

Composite score (0 to 100) per category for the top 15 overall models. Full leaderboard is on the Generalist page.

Category gpt-5.3claude-opus-4.7claude-sonnet-4.6claude-opus-4.6claude-opus-4.5gpt-5.2gpt-5.4claude-sonnet-4.5gemini-3-flashgpt-5.1glm-5kimi-k2.5gemini-3-progemini-3.1-progpt-5.5
behavioural959591939190899187928990868593
coding989695949496999293959194919191
instruction following978991868592909092859783929694
learning991001001009910099989910098100999895
meta859898939786869389947987939182
reasoning959695969694909596929796959695
research979597989898989596969595949889
writing979597989796979898979898999098

Top 5 Across Categories

Composite score (0 to 100) per category for the top 5 overall models. Wider polygon = more consistent across categories.

Behavioural

Sycophancy resistance, calibration under social pressure, pushback against confident-but-wrong claims.

Coding

Code review, debugging, implementation. Tests pattern recognition, language-specific knowledge, and ability to spot subtle bugs.

Instruction Following

Strict format and constraint adherence: exact list lengths, ordered steps, banned words, structural rules.

Learning

Explanatory writing on technical topics. Tests how well the model teaches a concept to a target audience.

Meta

Calibration and self-awareness: recognising false premises, hedging appropriately, knowing when to refuse.

Reasoning

Multi-step quantitative reasoning, Fermi estimation, logical deduction, statistical analysis.

Research

Open-ended research and synthesis: comparisons, tradeoff analysis, design recommendations.

Writing

Production writing (docs, summaries, explanations) with constraints on length, audience, and format.