Top Model

gpt-5.3

96 general

Models Evaluated

3920 total scored responses

Most Efficient

0.55

gpt-5.3

Top Causal Model

claude-opus-4.6

76 causal

How to read this dashboard

Two independent benchmarks, both scored 0–100 so they're directly comparable. General tests breadth: 80 prompts across 8 categories (coding, reasoning, writing, instruction-following, etc.), scored by 4 LLM judges plus DeepEval. Causal tests narrow causal-inference reasoning: 100 multiple-choice questions across 20 concept bundles. A model can be strong on one and weak on the other; the scatter below shows the trade-off.

General × Causal ?

Each dot is a model, coloured by company. Top-right is best on both. Models without causal data are not shown.

Company Progress Over Time ?

Each company's best published model over time. Composite (0 to 100).

Top 10 Generalist ?

Best 10 on the breadth benchmark.

Top 10 Causal ?

Best 10 on causal-inference reasoning.

BenchPress - LLM Evaluation Leaderboard

How to read this dashboard

General × Causal ?

Company Progress Over Time ?

Top 10 Generalist ?

Top 10 Causal ?

Generalist leaderboard →

Causal leaderboard →

Browse by company →

Browse by category →