49 models · 80 prompts · 8 categories · Judges: gpt-4.1, claude-sonnet-4.6, gemini-2.5-flash, qwen3-235b · Updated Apr 27, 2026 17:51

Company Progress Over Time ?

Best Model per Company ?

Category Strengths by Company ?

Models by Company

Alibaba (3 models) - best: 93
Model Composite Judge DeepEval Avg Latency Efficiency
qwen3-235b 93 4.60/5 95 8.4s 0.48
qwen3-32b 86 4.22/5 92 4.3s 0.44
qwen3-coder-30b 84 4.12/5 90 1.9s 0.45
Amazon (4 models) - best: 84
Model Composite Judge DeepEval Avg Latency Efficiency
nova-2-lite 84 4.15/5 89 17.6s 0.41
nova-pro 78 3.79/5 85 9.7s 0.42
nova-lite 70 3.35/5 80 2.3s 0.37
nova-micro 68 3.35/5 78 14.1s 0.37
Anthropic (9 models) - best: 96
Model Composite Judge DeepEval Avg Latency Efficiency
claude-opus-4.7 96 4.75/5 97 13.1s 0.49
claude-sonnet-4.6 95 4.78/5 96 19.2s 0.48
claude-opus-4.6 95 4.73/5 97 22.0s 0.47
claude-opus-4.5 94 4.73/5 96 20.7s 0.48
claude-sonnet-4.5 94 4.71/5 95 13.9s 0.50
claude-opus-4 93 4.62/5 95 16.5s 0.49
claude-sonnet-4 93 4.60/5 95 13.3s 0.48
claude-sonnet-3.7 87 4.31/5 92 7.3s 0.49
claude-haiku-3 71 3.48/5 80 3.9s 0.40
Cohere (1 model) - best: 83
Model Composite Judge DeepEval Avg Latency Efficiency
command-a 83 4.12/5 89 21.0s 0.45
Google (5 models) - best: 94
Model Composite Judge DeepEval Avg Latency Efficiency
gemini-3-flash 94 4.71/5 95 10.1s 0.49
gemini-3-pro 93 4.67/5 95 21.1s 0.48
gemini-3.1-pro 93 4.65/5 94 39.2s 0.48
gemini-2.5-flash 88 4.39/5 91 13.7s 0.44
gemma-3-27b 86 4.19/5 91 20.9s 0.42
Meta (5 models) - best: 83
Model Composite Judge DeepEval Avg Latency Efficiency
llama-4-maverick 83 4.07/5 88 1.4s 0.45
llama-4-scout 79 3.84/5 87 1.8s 0.43
llama3.1 67 3.27/5 77 15.6s 0.38
llama3.2-vision-11b 62 3.08/5 72 30.4s 0.35
llama3.2 59 2.95/5 70 13.5s 0.34
MiniMax (1 model) - best: 93
Model Composite Judge DeepEval Avg Latency Efficiency
minimax-m2.5 93 4.63/5 94 39.7s 0.44
Mistral (2 models) - best: 88
Model Composite Judge DeepEval Avg Latency Efficiency
mistral-large-3 88 4.34/5 93 24.4s 0.43
codestral 65 3.27/5 73 58.2s 0.38
Moonshot (1 model) - best: 94
Model Composite Judge DeepEval Avg Latency Efficiency
kimi-k2.5 94 4.67/5 96 41.1s 0.43
OpenAI (14 models) - best: 96
Model Composite Judge DeepEval Avg Latency Efficiency
gpt-5.3 96 4.83/5 97 5.7s 0.55
gpt-5.2 94 4.72/5 96 12.8s 0.50
gpt-5.4 94 4.70/5 96 13.7s 0.49
gpt-5.1 94 4.68/5 96 15.4s 0.49
gpt-5.5 93 4.58/5 96 22.0s 0.45
gpt-4.1 92 4.55/5 95 9.0s 0.51
gpt-oss-120b 92 4.58/5 93 6.5s 0.41
o4-mini 91 4.52/5 95 13.0s 0.44
o3-mini 91 4.51/5 93 10.6s 0.45
gpt-4.1-mini 91 4.51/5 93 10.4s 0.50
gpt-oss-20b 88 4.43/5 90 142.1s 0.40
gpt-4.1-nano 87 4.26/5 92 5.1s 0.49
gpt-4o 83 4.11/5 89 7.7s 0.48
gpt-4o-mini 82 4.01/5 88 7.8s 0.45
Zhipu (2 models) - best: 94
Model Composite Judge DeepEval Avg Latency Efficiency
glm-5 94 4.70/5 95 66.5s 0.42
glm-4.7-flash 86 4.26/5 90 38.6s 0.38
xAI (2 models) - best: 92
Model Composite Judge DeepEval Avg Latency Efficiency
grok-4 92 4.61/5 95 37.0s 0.47
grok-4.1-fast 90 4.44/5 94 9.4s 0.47