49 models · 80 prompts · 8 categories · Judges: gpt-4.1, claude-sonnet-4.6, gemini-2.5-flash, qwen3-235b · Updated Apr 27, 2026 17:51
Top Model
gpt-5.3
96 general
Models Evaluated
49
3920 total scored responses
Most Efficient
0.55
gpt-5.3
Top Causal Model
claude-opus-4.6
76 causal

How to read this dashboard

Two independent benchmarks, both scored 0–100 so they're directly comparable. General tests breadth: 80 prompts across 8 categories (coding, reasoning, writing, instruction-following, etc.), scored by 4 LLM judges plus DeepEval. Causal tests narrow causal-inference reasoning: 100 multiple-choice questions across 20 concept bundles. A model can be strong on one and weak on the other; the scatter below shows the trade-off.

General × Causal ?

Each dot is a model, coloured by company. Top-right is best on both. Models without causal data are not shown.

Company Progress Over Time ?

Each company's best published model over time. Composite (0 to 100).

Top 10 Generalist ?

Best 10 on the breadth benchmark.

Top 10 Causal ?

Best 10 on causal-inference reasoning.

Generalist leaderboard →

Full 49-model leaderboard, DeepEval breakdown, difficulty curve, distributions, flags.

Causal leaderboard →

Per-variant accuracy, bundle consistency heatmap, excluded models.

Browse by company →

Per-company tables, category strengths heatmap, frontier history.

Browse by category →

Coding, writing, reasoning, instruction-following, behavioural, and more.