51 models · 80 prompts · 8 categories · Judges: gpt-4.1, claude-sonnet-4.6, gemini-2.5-flash, qwen3-235b · Updated Jun 10, 2026 12:27

Company Progress Over Time ?

Best Model per Company ?

Category Strengths by Company ?

Models by Company

Alibaba (3 models) - best: 93
Model Composite Judge DeepEval Avg Latency Efficiency
qwen3-235b 93 4.65/5 96 8.5s 0.48
qwen3-32b 86 4.22/5 92 4.3s 0.44
qwen3-coder-30b 84 4.12/5 90 1.9s 0.45
Amazon (4 models) - best: 84
Model Composite Judge DeepEval Avg Latency Efficiency
nova-2-lite 84 4.15/5 89 17.6s 0.41
nova-pro 78 3.79/5 85 9.7s 0.42
nova-lite 70 3.38/5 81 2.3s 0.38
nova-micro 68 3.35/5 78 14.1s 0.37
Anthropic (11 models) - best: 96
Model Composite Judge DeepEval Avg Latency Efficiency
claude-opus-4.8 96 4.81/5 98 12.7s 0.49
claude-fable-5 96 4.80/5 97 18.3s 0.47
claude-opus-4.7 96 4.75/5 97 13.1s 0.49
claude-sonnet-4.6 95 4.78/5 96 19.2s 0.48
claude-opus-4.6 95 4.73/5 97 22.0s 0.47
claude-opus-4.5 94 4.73/5 96 20.7s 0.48
claude-sonnet-4.5 94 4.71/5 95 13.9s 0.50
claude-opus-4 93 4.62/5 95 16.5s 0.49
claude-sonnet-4 93 4.60/5 95 13.3s 0.48
claude-sonnet-3.7 87 4.31/5 92 7.3s 0.49
claude-haiku-3 72 3.51/5 81 3.9s 0.40
Cohere (1 model) - best: 84
Model Composite Judge DeepEval Avg Latency Efficiency
command-a 84 4.14/5 89 21.5s 0.45
Google (5 models) - best: 94
Model Composite Judge DeepEval Avg Latency Efficiency
gemini-3-flash 94 4.70/5 95 10.2s 0.49
gemini-3-pro 93 4.66/5 95 21.5s 0.48
gemini-3.1-pro 93 4.65/5 94 39.5s 0.48
gemini-2.5-flash 88 4.38/5 91 14.0s 0.44
gemma-3-27b 86 4.23/5 92 21.1s 0.42
Meta (5 models) - best: 82
Model Composite Judge DeepEval Avg Latency Efficiency
llama-4-maverick 82 4.06/5 88 1.4s 0.45
llama-4-scout 79 3.83/5 87 1.8s 0.43
llama3.1 67 3.30/5 77 15.8s 0.38
llama3.2-vision-11b 62 3.08/5 72 30.4s 0.35
llama3.2 59 2.95/5 70 13.5s 0.34
MiniMax (1 model) - best: 93
Model Composite Judge DeepEval Avg Latency Efficiency
minimax-m2.5 93 4.63/5 95 40.0s 0.44
Mistral (2 models) - best: 88
Model Composite Judge DeepEval Avg Latency Efficiency
mistral-large-3 88 4.34/5 93 24.4s 0.43
codestral 65 3.27/5 73 58.2s 0.38
Moonshot (1 model) - best: 94
Model Composite Judge DeepEval Avg Latency Efficiency
kimi-k2.5 94 4.67/5 95 41.6s 0.43
OpenAI (14 models) - best: 96
Model Composite Judge DeepEval Avg Latency Efficiency
gpt-5.3 96 4.83/5 97 5.7s 0.55
gpt-5.5 95 4.77/5 96 19.1s 0.48
gpt-5.2 95 4.77/5 96 12.9s 0.51
gpt-5.4 95 4.75/5 96 13.9s 0.49
gpt-5.1 95 4.72/5 96 15.8s 0.49
gpt-4.1 92 4.59/5 95 9.1s 0.51
o4-mini 92 4.60/5 95 12.7s 0.45
gpt-oss-120b 91 4.57/5 93 6.6s 0.41
o3-mini 91 4.51/5 93 10.7s 0.45
gpt-4.1-mini 91 4.51/5 93 10.5s 0.50
gpt-oss-20b 88 4.42/5 91 145.7s 0.39
gpt-4.1-nano 87 4.26/5 92 5.1s 0.49
gpt-4o 83 4.11/5 89 7.7s 0.48
gpt-4o-mini 82 4.01/5 88 7.8s 0.45
Zhipu (2 models) - best: 94
Model Composite Judge DeepEval Avg Latency Efficiency
glm-5 94 4.70/5 95 67.9s 0.42
glm-4.7-flash 86 4.25/5 90 38.9s 0.38
xAI (2 models) - best: 93
Model Composite Judge DeepEval Avg Latency Efficiency
grok-4 93 4.60/5 95 37.4s 0.47
grok-4.1-fast 91 4.48/5 94 9.5s 0.47