BenchPress - Judge Analysis

A note on judge bias

The four judges (claude-sonnet-4.6, gemini-2.5-flash, gpt-4.1, qwen3-235b) are themselves LLMs from companies whose models also appear on the leaderboard. This is a real conflict-of-interest concern. Two safeguards: (1) self-judging is prevented: a judge never scores its own family's responses (e.g. gpt-4.1 does not judge gpt-4.1, gpt-4o, etc.). (2) The Judge Agreement matrix and Biggest Disagreements table below let you inspect where judges disagree and decide whether bias is bounded for your use case.

Judge Score Distributions ?

Judge Agreement ?

	claude-sonnet-4.6	gemini-2.5-flash	gpt-4.1	qwen3-235b
claude-sonnet-4.6	-	93% diff 0.43	96% diff 0.38	97% diff 0.36
gemini-2.5-flash	93% diff 0.43	-	94% diff 0.34	94% diff 0.34
gpt-4.1	96% diff 0.38	94% diff 0.34	-	99% diff 0.21
qwen3-235b	97% diff 0.36	94% diff 0.34	99% diff 0.21	-

Biggest Disagreements ?

Prompt	Model	Category	claude-sonnet-4.6	gemini-2.5-flash	gpt-4.1	qwen3-235b	Spread
I01	claude-opus-4.5	instruction_following	1/5	4/5	5/5	2/5	4
I01	claude-opus-4.6	instruction_following	1/5	5/5	2/5	2/5	4
I01	claude-opus-4.7	instruction_following	1/5	5/5	3/5	2/5	4
I01	claude-opus-4.8	instruction_following	1/5	5/5	3/5	2/5	4
C01	glm-4.7-flash	coding	1/5	5/5	2/5	2/5	4
I06	glm-4.7-flash	instruction_following	1/5	5/5	2/5	1/5	4
I01	gpt-4.1-mini	instruction_following	1/5	5/5	3/5	2/5	4
B08	mistral-large-3	behavioural	1/5	5/5	2/5	2/5	4
B08	nova-micro	behavioural	1/5	5/5	3/5	4/5	4
R10	claude-fable-5	reasoning	2/5	5/5	2/5	2/5	3
I01	claude-haiku-3	instruction_following	3/5	2/5	4/5	5/5	3
R03	claude-haiku-3	reasoning	5/5	2/5	5/5	5/5	3
R10	claude-opus-4.5	reasoning	2/5	5/5	3/5	2/5	3
I08	claude-opus-4.8	instruction_following	2/5	5/5	2/5	2/5	3
R10	claude-opus-4.8	reasoning	2/5	5/5	4/5	3/5	3
I01	claude-opus-4	instruction_following	1/5	4/5	2/5	2/5	3
C06	claude-sonnet-3.7	coding	4/5	2/5	5/5	5/5	3
I01	claude-sonnet-3.7	instruction_following	1/5	4/5	4/5	2/5	3
I06	claude-sonnet-3.7	instruction_following	5/5	2/5	5/5	5/5	3
I01	claude-sonnet-4.5	instruction_following	1/5	3/5	4/5	2/5	3

BenchPress - Judge Analysis

A note on judge bias

Judge Score Distributions ?

By Model (Top 15) ?

By Category ?

By Difficulty ?

Judge vs DeepEval Divergence ?

Judge Agreement ?

Biggest Disagreements ?