BenchPress - Judge Analysis

4 judge(s) · 48 models · 80 prompts · Updated Mar 07, 2026 06:56
claude-sonnet-4.6Strictest
4.05/5
3760 prompts scored
gemini-2.5-flash
4.28/5
3760 prompts scored
gpt-4.1
4.36/5
3760 prompts scored
qwen3-235bMost Lenient
4.36/5
3760 prompts scored

Judge Score Distributions ?

By Model (Top 15) ?

By Category ?

By Difficulty ?

Judge vs DeepEval Divergence ?

Judge Agreement ?

claude-sonnet-4.6gemini-2.5-flashgpt-4.1qwen3-235b
claude-sonnet-4.6-
93%
diff 0.44
96%
diff 0.40
97%
diff 0.38
gemini-2.5-flash
93%
diff 0.44
-
94%
diff 0.36
94%
diff 0.35
gpt-4.1
96%
diff 0.40
94%
diff 0.36
-
99%
diff 0.22
qwen3-235b
97%
diff 0.38
94%
diff 0.35
99%
diff 0.22
-

Biggest Disagreements ?

Prompt Model Category claude-sonnet-4.6gemini-2.5-flashgpt-4.1qwen3-235b Spread
I01 claude-opus-4.5 instruction_following 1/54/55/52/5 4
I01 claude-opus-4.6 instruction_following 1/55/52/52/5 4
C01 glm-4.7-flash coding 1/55/52/52/5 4
I06 glm-4.7-flash instruction_following 1/55/52/51/5 4
I01 gpt-4.1-mini instruction_following 1/55/53/52/5 4
B08 mistral-large-3 behavioural 1/55/52/52/5 4
B08 nova-micro behavioural 1/55/53/54/5 4
I01 claude-haiku-3 instruction_following 3/52/54/55/5 3
R03 claude-haiku-3 reasoning 5/52/55/55/5 3
R10 claude-opus-4.5 reasoning 2/55/53/52/5 3
I01 claude-opus-4 instruction_following 1/54/52/52/5 3
C06 claude-sonnet-3.7 coding 4/52/55/55/5 3
I01 claude-sonnet-3.7 instruction_following 1/54/54/52/5 3
I06 claude-sonnet-3.7 instruction_following 5/52/55/55/5 3
I01 claude-sonnet-4.5 instruction_following 1/53/54/52/5 3
I05 claude-sonnet-4.5 instruction_following 4/52/54/55/5 3
I06 claude-sonnet-4.5 instruction_following 5/52/55/55/5 3
R05 claude-sonnet-4.5 reasoning 4/52/55/55/5 3
I05 claude-sonnet-4.6 instruction_following -2/53/55/5 3
I01 claude-sonnet-4 instruction_following 2/55/54/52/5 3