BenchPress - Causal Reasoning

100
Questions
44
Models Tested
76.0%
Best Accuracy

About the causal benchmark

A standalone benchmark for causal-inference reasoning. 100 multiple-choice questions arranged as 20 concept bundles × 5 variants. Each bundle covers one causal-inference pitfall (e.g. confounding, M-bias, Simpson's paradox); the 5 variants stress-test the same concept different ways:

Scoring is purely deterministic (multiple choice, no LLM judges needed). Accuracy = correct ÷ valid responses. Errors are API failures; Invalid is the model returning an empty or unextractable response. Both are reported separately.

Top performer per variant

Each variant tests the same causal-inference concept from a different angle. Best score on each is shown alongside what the variant tests.

Base
claude-opus-4.6
85%
Narrative scenario combining 2-3 interacting causal issues. Tests recognising compound problems.
Trap
claude-opus-4.6
85%
Looks like the base concept applies but the obvious answer is wrong. Tests resistance to pattern matching.
Transfer
command-a
65%
Same reasoning translated into a formal DAG with short elimination-style options. Tests structural reasoning.
Numeric
grok-4
90%
Multi-step calculation with tables and conditional probabilities. Cannot be answered by intuition alone.
Analyst
grok-4
85%
Two analysts debate the scenario; identify which assessment is most accurate. Tests evaluating arguments.

Per-variant accuracy (Top 10)

For the top 10 overall models, accuracy on each of the five variants. Bars that drop sharply on Trap or Transfer reveal where a model relies on pattern-matching over structural reasoning.

Variant heatmap (Top 15)

Composite shading per variant for the top 15 overall models. Green ≥80, yellow ≥60, red below.

Modelbasetraptransfernumericanalyst
claude-opus-4.68585557580
claude-sonnet-4.68585558075
grok-48580409085
qwen3-235b8585458085
claude-opus-4.78585557575
claude-opus-48575607580
gemini-3-pro8575458585
gpt-5.58570508585
o3-mini8575509075
o4-mini8570508585
claude-sonnet-48580507580
gemini-3.1-pro8575408585
gpt-5.18585606080
gpt-5.38575458580
claude-opus-4.58585457080

Accuracy leaderboard

All models with causal data. Score is correct out of 100. Errors and Invalid columns separate API failures from empty/unextractable responses.

#ModelAccuracyScoreErrorsInvalid
1 claude-opus-4.6
Anthropic
76.0%
76/100 0 0
2 claude-sonnet-4.6
Anthropic
76.0%
76/100 0 0
3 grok-4
xAI
76.0%
76/100 0 0
4 qwen3-235b
Alibaba
76.0%
76/100 0 0
5 claude-opus-4.7
Anthropic
75.0%
75/100 0 0
6 claude-opus-4
Anthropic
75.0%
75/100 0 0
7 gemini-3-pro
Google
75.0%
75/100 0 0
8 gpt-5.5
OpenAI
75.0%
75/100 0 0
9 o3-mini
OpenAI
75.0%
75/100 0 0
10 o4-mini
OpenAI
75.0%
75/100 0 0
11 claude-sonnet-4
Anthropic
74.0%
74/100 0 0
12 gemini-3.1-pro
Google
74.0%
74/100 0 0
13 gpt-5.1
OpenAI
74.0%
74/100 0 0
14 gpt-5.3
OpenAI
74.0%
74/100 0 0
15 claude-opus-4.5
Anthropic
73.0%
73/100 0 0
16 command-a
Cohere
73.0%
73/100 0 0
17 gpt-5.2
OpenAI
73.0%
73/100 0 0
18 gpt-oss-120b
OpenAI
72.0%
72/100 0 0
19 minimax-m2.5
MiniMax
72.0%
72/100 0 0
20 gpt-4o
OpenAI
71.0%
71/100 0 0
21 llama-4-scout
Meta
71.0%
71/100 0 0
22 nova-pro
Amazon
71.0%
71/100 0 0
23 claude-sonnet-4.5
Anthropic
70.0%
70/100 0 0
24 gpt-4.1-mini
OpenAI
70.0%
70/100 0 0
25 grok-4.1-fast
xAI
70.0%
70/100 0 0
26 gpt-4.1
OpenAI
69.0%
69/100 0 0
27 gpt-5.4
OpenAI
69.0%
69/100 0 0
28 gpt-oss-20b
OpenAI
69.0%
69/100 0 0
29 nova-2-lite
Amazon
69.0%
69/100 0 0
30 glm-4.7-flash
Zhipu
66.0%
66/100 0 0
31 mistral-large-3
Mistral
66.0%
66/100 0 0
32 qwen3-32b
Alibaba
65.0%
65/100 0 0
33 qwen3-coder-30b
Alibaba
65.0%
65/100 0 0
34 gpt-4o-mini
OpenAI
64.0%
64/100 0 0
35 gpt-4.1-nano
OpenAI
63.0%
63/100 0 0
36 nova-lite
Amazon
63.0%
63/100 0 0
37 gemini-3-flash
Google
62.0%
62/100 0 0
38 nova-micro
Amazon
62.0%
62/100 0 0
39 gemini-2.5-flash
Google
56.0%
56/100 0 0
40 llama3.1
Meta
54.0%
54/100 0 0
41 llama3.2-vision-11b
Meta
51.0%
51/100 0 0
42 codestral
Mistral
45.0%
45/100 0 0
43 llama3.2
Meta
45.0%
45/100 0 0
44 gemma-3-27b
Google
43.0%
43/100 0 0