4 judge(s) · 49 models · 80 prompts · Updated Apr 27, 2026 17:51
claude-sonnet-4.6Strictest
4.07/5
3840 prompts scored
gemini-2.5-flash
4.30/5
3839 prompts scored
gpt-4.1
4.38/5
3840 prompts scored
qwen3-235bMost Lenient
4.38/5
3840 prompts scored

A note on judge bias

The four judges (claude-sonnet-4.6, gemini-2.5-flash, gpt-4.1, qwen3-235b) are themselves LLMs from companies whose models also appear on the leaderboard. This is a real conflict-of-interest concern. Two safeguards: (1) self-judging is prevented: a judge never scores its own family's responses (e.g. gpt-4.1 does not judge gpt-4.1, gpt-4o, etc.). (2) The Judge Agreement matrix and Biggest Disagreements table below let you inspect where judges disagree and decide whether bias is bounded for your use case.

Judge Score Distributions ?

By Model (Top 15) ?

By Category ?

By Difficulty ?

Judge vs DeepEval Divergence ?

Judge Agreement ?

claude-sonnet-4.6gemini-2.5-flashgpt-4.1qwen3-235b
claude-sonnet-4.6-
93%
diff 0.44
96%
diff 0.39
97%
diff 0.37
gemini-2.5-flash
93%
diff 0.44
-
94%
diff 0.35
94%
diff 0.35
gpt-4.1
96%
diff 0.39
94%
diff 0.35
-
99%
diff 0.22
qwen3-235b
97%
diff 0.37
94%
diff 0.35
99%
diff 0.22
-

Biggest Disagreements ?

Prompt Model Category claude-sonnet-4.6gemini-2.5-flashgpt-4.1qwen3-235b Spread
I01 claude-opus-4.5 instruction_following 1/54/55/52/5 4
I01 claude-opus-4.6 instruction_following 1/55/52/52/5 4
I01 claude-opus-4.7 instruction_following 1/55/53/52/5 4
C01 glm-4.7-flash coding 1/55/52/52/5 4
I06 glm-4.7-flash instruction_following 1/55/52/51/5 4
I01 gpt-4.1-mini instruction_following 1/55/53/52/5 4
B08 mistral-large-3 behavioural 1/55/52/52/5 4
B08 nova-micro behavioural 1/55/53/54/5 4
I01 claude-haiku-3 instruction_following 3/52/54/55/5 3
R03 claude-haiku-3 reasoning 5/52/55/55/5 3
R10 claude-opus-4.5 reasoning 2/55/53/52/5 3
I01 claude-opus-4 instruction_following 1/54/52/52/5 3
C06 claude-sonnet-3.7 coding 4/52/55/55/5 3
I01 claude-sonnet-3.7 instruction_following 1/54/54/52/5 3
I06 claude-sonnet-3.7 instruction_following 5/52/55/55/5 3
I01 claude-sonnet-4.5 instruction_following 1/53/54/52/5 3
I05 claude-sonnet-4.5 instruction_following 4/52/54/55/5 3
I06 claude-sonnet-4.5 instruction_following 5/52/55/55/5 3
R05 claude-sonnet-4.5 reasoning 4/52/55/55/5 3
I05 claude-sonnet-4.6 instruction_following -2/53/55/5 3