Two independent benchmarks, both scored 0–100 so they're directly comparable. General tests breadth: 80 prompts across 8 categories (coding, reasoning, writing, instruction-following, etc.), scored by 4 LLM judges plus DeepEval. Causal tests narrow causal-inference reasoning: 100 multiple-choice questions across 20 concept bundles. A model can be strong on one and weak on the other; the scatter below shows the trade-off.
Each dot is a model, coloured by company. Top-right is best on both. Models without causal data are not shown.
Each company's best published model over time. Composite (0 to 100).
Best 10 on the breadth benchmark.
Best 10 on causal-inference reasoning.
Full 49-model leaderboard, DeepEval breakdown, difficulty curve, distributions, flags.
Per-variant accuracy, bundle consistency heatmap, excluded models.
Per-company tables, category strengths heatmap, frontier history.
Coding, writing, reasoning, instruction-following, behavioural, and more.