BenchPress
- LLM Evaluation Leaderboard
Overview
Companies
By Category
Judges
Methodology
Opinionated in scope. Objective in execution.
48 models · 80 prompts · 8 categories · Judges: gpt-4.1, claude-sonnet-4.6, gemini-2.5-flash, qwen3-235b · Updated Mar 07, 2026 06:56
Behavioural
gpt-5.3
0.95
Coding
gpt-5.4
0.99
Instruction Following
gpt-oss-120b
0.98
Learning
claude-sonnet-4.6
0.99
Meta
claude-sonnet-4.6
0.98
Reasoning
grok-4
0.97
Research
claude-opus-4.6
0.98
Writing
gemini-3-pro
0.99
Behavioural
Coding
Instruction Following
Learning
Meta
Reasoning
Research
Writing