Composite score (0 to 100) per category for the top 15 overall models. Full leaderboard is on the Generalist page.
| Category | gpt-5.3 | claude-opus-4.7 | claude-sonnet-4.6 | claude-opus-4.6 | claude-opus-4.5 | gpt-5.2 | gpt-5.4 | claude-sonnet-4.5 | gemini-3-flash | gpt-5.1 | glm-5 | kimi-k2.5 | gemini-3-pro | gemini-3.1-pro | gpt-5.5 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| behavioural | 95 | 95 | 91 | 93 | 91 | 90 | 89 | 91 | 87 | 92 | 89 | 90 | 86 | 85 | 93 |
| coding | 98 | 96 | 95 | 94 | 94 | 96 | 99 | 92 | 93 | 95 | 91 | 94 | 91 | 91 | 91 |
| instruction following | 97 | 89 | 91 | 86 | 85 | 92 | 90 | 90 | 92 | 85 | 97 | 83 | 92 | 96 | 94 |
| learning | 99 | 100 | 100 | 100 | 99 | 100 | 99 | 98 | 99 | 100 | 98 | 100 | 99 | 98 | 95 |
| meta | 85 | 98 | 98 | 93 | 97 | 86 | 86 | 93 | 89 | 94 | 79 | 87 | 93 | 91 | 82 |
| reasoning | 95 | 96 | 95 | 96 | 96 | 94 | 90 | 95 | 96 | 92 | 97 | 96 | 95 | 96 | 95 |
| research | 97 | 95 | 97 | 98 | 98 | 98 | 98 | 95 | 96 | 96 | 95 | 95 | 94 | 98 | 89 |
| writing | 97 | 95 | 97 | 98 | 97 | 96 | 97 | 98 | 98 | 97 | 98 | 98 | 99 | 90 | 98 |
Composite score (0 to 100) per category for the top 5 overall models. Wider polygon = more consistent across categories.
Sycophancy resistance, calibration under social pressure, pushback against confident-but-wrong claims.
Code review, debugging, implementation. Tests pattern recognition, language-specific knowledge, and ability to spot subtle bugs.
Strict format and constraint adherence: exact list lengths, ordered steps, banned words, structural rules.
Explanatory writing on technical topics. Tests how well the model teaches a concept to a target audience.
Calibration and self-awareness: recognising false premises, hedging appropriately, knowing when to refuse.
Multi-step quantitative reasoning, Fermi estimation, logical deduction, statistical analysis.
Open-ended research and synthesis: comparisons, tradeoff analysis, design recommendations.
Production writing (docs, summaries, explanations) with constraints on length, audience, and format.