A standalone benchmark for causal-inference reasoning. 100 multiple-choice questions arranged as 20 concept bundles × 5 variants. Each bundle covers one causal-inference pitfall (e.g. confounding, M-bias, Simpson's paradox); the 5 variants stress-test the same concept different ways:
Scoring is purely deterministic (multiple choice, no LLM judges needed). Accuracy = correct ÷ valid responses. Errors are API failures; Invalid is the model returning an empty or unextractable response. Both are reported separately.
Each variant tests the same causal-inference concept from a different angle. Best score on each is shown alongside what the variant tests.
For the top 10 overall models, accuracy on each of the five variants. Bars that drop sharply on Trap or Transfer reveal where a model relies on pattern-matching over structural reasoning.
Composite shading per variant for the top 15 overall models. Green ≥80, yellow ≥60, red below.
| Model | base | trap | transfer | numeric | analyst |
|---|---|---|---|---|---|
| claude-opus-4.6 | 85 | 85 | 55 | 75 | 80 |
| claude-sonnet-4.6 | 85 | 85 | 55 | 80 | 75 |
| grok-4 | 85 | 80 | 40 | 90 | 85 |
| qwen3-235b | 85 | 85 | 45 | 80 | 85 |
| claude-opus-4.7 | 85 | 85 | 55 | 75 | 75 |
| claude-opus-4 | 85 | 75 | 60 | 75 | 80 |
| gemini-3-pro | 85 | 75 | 45 | 85 | 85 |
| gpt-5.5 | 85 | 70 | 50 | 85 | 85 |
| o3-mini | 85 | 75 | 50 | 90 | 75 |
| o4-mini | 85 | 70 | 50 | 85 | 85 |
| claude-sonnet-4 | 85 | 80 | 50 | 75 | 80 |
| gemini-3.1-pro | 85 | 75 | 40 | 85 | 85 |
| gpt-5.1 | 85 | 85 | 60 | 60 | 80 |
| gpt-5.3 | 85 | 75 | 45 | 85 | 80 |
| claude-opus-4.5 | 85 | 85 | 45 | 70 | 80 |
All models with causal data. Score is correct out of 100. Errors and Invalid columns separate API failures from empty/unextractable responses.
| # | Model | Accuracy | Score | Errors | Invalid |
|---|---|---|---|---|---|
| 1 | claude-opus-4.6 Anthropic |
76.0% |
76/100 | 0 | 0 |
| 2 | claude-sonnet-4.6 Anthropic |
76.0% |
76/100 | 0 | 0 |
| 3 | grok-4 xAI |
76.0% |
76/100 | 0 | 0 |
| 4 | qwen3-235b Alibaba |
76.0% |
76/100 | 0 | 0 |
| 5 | claude-opus-4.7 Anthropic |
75.0% |
75/100 | 0 | 0 |
| 6 | claude-opus-4 Anthropic |
75.0% |
75/100 | 0 | 0 |
| 7 | gemini-3-pro |
75.0% |
75/100 | 0 | 0 |
| 8 | gpt-5.5 OpenAI |
75.0% |
75/100 | 0 | 0 |
| 9 | o3-mini OpenAI |
75.0% |
75/100 | 0 | 0 |
| 10 | o4-mini OpenAI |
75.0% |
75/100 | 0 | 0 |
| 11 | claude-sonnet-4 Anthropic |
74.0% |
74/100 | 0 | 0 |
| 12 | gemini-3.1-pro |
74.0% |
74/100 | 0 | 0 |
| 13 | gpt-5.1 OpenAI |
74.0% |
74/100 | 0 | 0 |
| 14 | gpt-5.3 OpenAI |
74.0% |
74/100 | 0 | 0 |
| 15 | claude-opus-4.5 Anthropic |
73.0% |
73/100 | 0 | 0 |
| 16 | command-a Cohere |
73.0% |
73/100 | 0 | 0 |
| 17 | gpt-5.2 OpenAI |
73.0% |
73/100 | 0 | 0 |
| 18 | gpt-oss-120b OpenAI |
72.0% |
72/100 | 0 | 0 |
| 19 | minimax-m2.5 MiniMax |
72.0% |
72/100 | 0 | 0 |
| 20 | gpt-4o OpenAI |
71.0% |
71/100 | 0 | 0 |
| 21 | llama-4-scout Meta |
71.0% |
71/100 | 0 | 0 |
| 22 | nova-pro Amazon |
71.0% |
71/100 | 0 | 0 |
| 23 | claude-sonnet-4.5 Anthropic |
70.0% |
70/100 | 0 | 0 |
| 24 | gpt-4.1-mini OpenAI |
70.0% |
70/100 | 0 | 0 |
| 25 | grok-4.1-fast xAI |
70.0% |
70/100 | 0 | 0 |
| 26 | gpt-4.1 OpenAI |
69.0% |
69/100 | 0 | 0 |
| 27 | gpt-5.4 OpenAI |
69.0% |
69/100 | 0 | 0 |
| 28 | gpt-oss-20b OpenAI |
69.0% |
69/100 | 0 | 0 |
| 29 | nova-2-lite Amazon |
69.0% |
69/100 | 0 | 0 |
| 30 | glm-4.7-flash Zhipu |
66.0% |
66/100 | 0 | 0 |
| 31 | mistral-large-3 Mistral |
66.0% |
66/100 | 0 | 0 |
| 32 | qwen3-32b Alibaba |
65.0% |
65/100 | 0 | 0 |
| 33 | qwen3-coder-30b Alibaba |
65.0% |
65/100 | 0 | 0 |
| 34 | gpt-4o-mini OpenAI |
64.0% |
64/100 | 0 | 0 |
| 35 | gpt-4.1-nano OpenAI |
63.0% |
63/100 | 0 | 0 |
| 36 | nova-lite Amazon |
63.0% |
63/100 | 0 | 0 |
| 37 | gemini-3-flash |
62.0% |
62/100 | 0 | 0 |
| 38 | nova-micro Amazon |
62.0% |
62/100 | 0 | 0 |
| 39 | gemini-2.5-flash |
56.0% |
56/100 | 0 | 0 |
| 40 | llama3.1 Meta |
54.0% |
54/100 | 0 | 0 |
| 41 | llama3.2-vision-11b Meta |
51.0% |
51/100 | 0 | 0 |
| 42 | codestral Mistral |
45.0% |
45/100 | 0 | 0 |
| 43 | llama3.2 Meta |
45.0% |
45/100 | 0 | 0 |
| 44 | gemma-3-27b |
43.0% |
43/100 | 0 | 0 |