Composite score (0 to 100) per category for the top 15 overall models. Full leaderboard is on the Generalist page.
| Category | claude-opus-4.8 | gpt-5.3 | claude-fable-5 | claude-opus-4.7 | claude-sonnet-4.6 | gpt-5.5 | gpt-5.2 | gpt-5.4 | claude-opus-4.6 | gpt-5.1 | claude-opus-4.5 | claude-sonnet-4.5 | gemini-3-flash | glm-5 | kimi-k2.5 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| behavioural | 93 | 95 | 90 | 95 | 91 | 97 | 90 | 89 | 93 | 91 | 91 | 91 | 87 | 88 | 90 |
| coding | 97 | 98 | 96 | 96 | 95 | 94 | 96 | 99 | 94 | 95 | 94 | 92 | 93 | 91 | 94 |
| instruction following | 90 | 97 | 93 | 89 | 91 | 93 | 97 | 96 | 86 | 90 | 85 | 90 | 92 | 97 | 81 |
| learning | 100 | 99 | 100 | 100 | 100 | 99 | 100 | 99 | 100 | 100 | 99 | 98 | 99 | 98 | 100 |
| meta | 100 | 85 | 100 | 98 | 98 | 82 | 86 | 86 | 93 | 94 | 97 | 93 | 89 | 79 | 87 |
| reasoning | 98 | 95 | 96 | 96 | 95 | 95 | 94 | 90 | 96 | 92 | 96 | 95 | 96 | 97 | 96 |
| research | 98 | 97 | 98 | 95 | 97 | 97 | 98 | 98 | 98 | 96 | 98 | 95 | 96 | 95 | 95 |
| writing | 98 | 97 | 96 | 95 | 97 | 98 | 96 | 97 | 98 | 97 | 97 | 98 | 98 | 98 | 98 |
Composite score (0 to 100) per category for the top 5 overall models. Wider polygon = more consistent across categories.
Sycophancy resistance, calibration under social pressure, pushback against confident-but-wrong claims.
Code review, debugging, implementation. Tests pattern recognition, language-specific knowledge, and ability to spot subtle bugs.
Strict format and constraint adherence: exact list lengths, ordered steps, banned words, structural rules.
Explanatory writing on technical topics. Tests how well the model teaches a concept to a target audience.
Calibration and self-awareness: recognising false premises, hedging appropriately, knowing when to refuse.
Multi-step quantitative reasoning, Fermi estimation, logical deduction, statistical analysis.
Open-ended research and synthesis: comparisons, tradeoff analysis, design recommendations.
Production writing (docs, summaries, explanations) with constraints on length, audience, and format.