About the causal benchmark

A standalone benchmark for causal-inference reasoning. 100 multiple-choice questions arranged as 20 concept bundles × 5 variants. Each bundle covers one causal-inference pitfall (e.g. confounding, M-bias, Simpson's paradox); the 5 variants stress-test the same concept different ways:

Base. Narrative scenario combining 2 to 3 interacting causal issues.
Trap. Looks like the base concept applies but the obvious answer is wrong.
Transfer. Same reasoning translated into a formal DAG with short elimination-style options.
Numeric. Multi-step calculation with tables and conditional probabilities.
Analyst. Two analysts debate the scenario; pick the most accurate assessment.

Scoring is purely deterministic (multiple choice, no LLM judges needed). Accuracy = correct ÷ valid responses. Errors are API failures; Invalid is the model returning an empty or unextractable response. Both are reported separately.

Top performer per variant

Each variant tests the same causal-inference concept from a different angle. Best score on each is shown alongside what the variant tests.

Base

claude-opus-4.6

85%

Narrative scenario combining 2-3 interacting causal issues. Tests recognising compound problems.

Trap

claude-opus-4.6

85%

Looks like the base concept applies but the obvious answer is wrong. Tests resistance to pattern matching.

Transfer

command-a

65%

Same reasoning translated into a formal DAG with short elimination-style options. Tests structural reasoning.

Numeric

grok-4

90%

Multi-step calculation with tables and conditional probabilities. Cannot be answered by intuition alone.

Analyst

grok-4

85%

Two analysts debate the scenario; identify which assessment is most accurate. Tests evaluating arguments.

Per-variant accuracy (Top 10)

For the top 10 overall models, accuracy on each of the five variants. Bars that drop sharply on Trap or Transfer reveal where a model relies on pattern-matching over structural reasoning.

Variant heatmap (Top 15)

Composite shading per variant for the top 15 overall models. Green ≥80, yellow ≥60, red below.

Model	base	trap	transfer	numeric	analyst
claude-opus-4.6	85	85	55	75	80
claude-sonnet-4.6	85	85	55	80	75
grok-4	85	80	40	90	85
qwen3-235b	85	85	45	80	85
claude-opus-4.7	85	85	55	75	75
claude-opus-4	85	75	60	75	80
gemini-3-pro	85	75	45	85	85
gpt-5.5	85	70	50	85	85
o3-mini	85	75	50	90	75
o4-mini	85	70	50	85	85
claude-sonnet-4	85	80	50	75	80
gemini-3.1-pro	85	75	40	85	85
gpt-5.1	85	85	60	60	80
gpt-5.3	85	75	45	85	80
claude-opus-4.5	85	85	45	70	80

Accuracy leaderboard

All models with causal data. Score is correct out of 100. Errors and Invalid columns separate API failures from empty/unextractable responses.

#	Model	Accuracy	Score
1	claude-opus-4.6 Anthropic	76.0%	76/100
2	claude-sonnet-4.6 Anthropic	76.0%	76/100
3	grok-4 xAI	76.0%	76/100
4	qwen3-235b Alibaba	76.0%	76/100
5	claude-opus-4.7 Anthropic	75.0%	75/100
6	claude-opus-4 Anthropic	75.0%	75/100
7	gemini-3-pro Google	75.0%	75/100
8	gpt-5.5 OpenAI	75.0%	75/100
9	o3-mini OpenAI	75.0%	75/100
10	o4-mini OpenAI	75.0%	75/100
11	claude-sonnet-4 Anthropic	74.0%	74/100
12	gemini-3.1-pro Google	74.0%	74/100
13	gpt-5.1 OpenAI	74.0%	74/100
14	gpt-5.3 OpenAI	74.0%	74/100
15	claude-opus-4.5 Anthropic	73.0%	73/100
16	command-a Cohere	73.0%	73/100
17	gpt-5.2 OpenAI	73.0%	73/100
18	gpt-oss-120b OpenAI	72.0%	72/100
19	minimax-m2.5 MiniMax	72.0%	72/100
20	gpt-4o OpenAI	71.0%	71/100
21	llama-4-scout Meta	71.0%	71/100
22	nova-pro Amazon	71.0%	71/100
23	claude-sonnet-4.5 Anthropic	70.0%	70/100
24	gpt-4.1-mini OpenAI	70.0%	70/100
25	grok-4.1-fast xAI	70.0%	70/100
26	gpt-4.1 OpenAI	69.0%	69/100
27	gpt-5.4 OpenAI	69.0%	69/100
28	gpt-oss-20b OpenAI	69.0%	69/100
29	nova-2-lite Amazon	69.0%	69/100
30	glm-4.7-flash Zhipu	66.0%	66/100
31	mistral-large-3 Mistral	66.0%	66/100
32	qwen3-32b Alibaba	65.0%	65/100
33	qwen3-coder-30b Alibaba	65.0%	65/100
34	gpt-4o-mini OpenAI	64.0%	64/100
35	gpt-4.1-nano OpenAI	63.0%	63/100
36	nova-lite Amazon	63.0%	63/100
37	gemini-3-flash Google	62.0%	62/100
38	nova-micro Amazon	62.0%	62/100
39	gemini-2.5-flash Google	56.0%	56/100
40	llama3.1 Meta	54.0%	54/100
41	llama3.2-vision-11b Meta	51.0%	51/100
42	codestral Mistral	45.0%	45/100
43	llama3.2 Meta	45.0%	45/100
44	gemma-3-27b Google	43.0%	43/100

BenchPress - Causal Reasoning

About the causal benchmark

Top performer per variant

Per-variant accuracy (Top 10)

Variant heatmap (Top 15)

Accuracy leaderboard