This evaluation measures what matters for practical, day-to-day use of LLMs as a working tool. It is not a general knowledge benchmark or a trivia test. The prompt set is designed around tasks a developer, researcher, or technical writer would actually ask an LLM to do, with emphasis on scenarios where models commonly fail or diverge.
Each model runs through the same pipeline for every prompt:
temperature: 0 for reproducibilityDeterministic, heuristic checks that run instantly on every response. These flag mechanical failures and feed into the judge as additional signal.
| Check Type | Prompts |
|---|---|
| acknowledges nonexistence | 1 |
| ambiguity check | 2 |
| banned words | 3 |
| code runnable | 5 |
| constraint check | 2 |
| hallucination api | 1 |
| json valid | 1 |
| multi step verify | 3 |
| refusal check | 3 |
| response length | 3 |
| self awareness | 1 |
| statistical significance | 1 |
| sycophancy check | 5 |
| table format | 1 |
| trap common error | 1 |
| trap no bug | 1 |
| trap wrong claim | 1 |
| word count | 2 |
| word count reduction | 1 |
These check types have no automated heuristic - the LLM judge scores them entirely on quality, reasoning, and adherence to criteria.
| Check Type | Prompts |
|---|---|
| analysis | 1 |
| behavioural | 3 |
| calibration | 2 |
| checklist | 2 |
| comparison | 3 |
| format check | 3 |
| reasoning | 26 |
| synthesis | 2 |
Each model response is scored by multiple independent LLM judges (configured in config.yaml),
each scoring on a 1-5 scale. The current judges are gpt-4.1 and claude-sonnet-4.6.
Each judge receives the original prompt, the ideal answer, the scoring criteria, and any
auto-check flags. It returns a score and a short rationale.
In addition to the multi-judge scores, each response is scored by DeepEval using G-Eval metrics - research-backed LLM evaluation criteria that provide multi-dimensional scoring on a 0-1 scale.
python run.py deepeval
The composite score merges the multi-judge average and DeepEval average into a single
0-1 metric for unified ranking. The judge score (mean of qualifying judges' averages)
is normalized from its 1-5 scale to 0-1 using (judge_score - 1) / 4,
then combined with the DeepEval average via a configurable weighted average.
Only judges with complete coverage (scored every prompt) contribute to the average.
Weights are configurable in config.yaml under the composite: section.
The efficiency score balances quality against verbosity:
efficiency = avg_score / log2(avg_tokens).
This rewards models that achieve high scores without padding responses with unnecessary tokens.
A concise, correct answer scores higher than an equally correct but bloated one.
| Category | Prompts | Subcategories | What It Tests |
|---|---|---|---|
| Behavioural | 12 | appropriate refusal, hallucination, sycophancy, unsolicited opinions, verbosity | Sycophancy resistance, hallucination detection, appropriate refusal, verbosity control, and unsolicited opinion avoidance. Tests character and safety alignment. |
| Coding | 15 | algorithm reasoning, architecture, bug detection, code generation, code review, concurrency, cross language, debugging, debugging reasoning, ml implementation, performance, refactoring, security, testing, vague spec | Bug detection (including trap prompts with no bug), code generation, debugging, architecture design, security review, refactoring, concurrency, ML implementation, and cross-language tasks. Medium to hard difficulty. |
| Instruction Following | 8 | ambiguity handling, conflicting constraints, creative constraint, exact format, format compliance, multi constraint, multi step, refusal calibration | Exact format compliance, multi-constraint tasks, conflicting instructions, creative constraints, and ambiguity handling. Tests literal instruction adherence. |
| Learning | 12 | calibration, comparison, concept explanation, emerging, factual, factual accuracy, methodology, nuanced explanation, practical, practical advice, trap | Technical explanations, factual accuracy, nuanced comparisons, calibration, and trap questions testing common misconceptions. Tests depth of understanding vs surface-level answers. |
| Meta | 5 | calibration, honesty under pressure, self knowledge, trap, uncertainty | Self-knowledge, calibration, honesty under pressure, and uncertainty expression. Tests whether models know what they don't know. |
| Reasoning | 12 | causal reasoning, estimation, ethical tradeoff, evidence evaluation, expected value, false premise, logic, math with distractors, software tradeoffs, statistics, tradeoff analysis | Fermi estimation, logic puzzles, statistical analysis, ethical tradeoffs, causal reasoning, and false premise detection. Tests whether models show their work and catch tricks. |
| Research | 6 | comparison, contradictory sources, crash course, summarization fidelity, synthesis, technical evaluation | Source synthesis, contradictory evidence handling, technical evaluation, and summarization fidelity. Tests analytical depth over breadth. |
| Writing | 10 | anti slop, argumentation, constraint following, documentation, editing, email drafting, structured, technical writing, tone switching | Technical writing, tone switching, anti-slop detection, constrained writing, editing, email drafting, and argumentation. Tests natural voice and format compliance. |
| Difficulty | Prompts |
|---|---|
| easy | 10 |
| medium | 36 |
| hard | 34 |
Exact prompts are not published to prevent models from being tuned to this specific benchmark. Categories, evaluation criteria, and scoring methodology are fully documented above. Each prompt is scored by automated checks where applicable, plus multi-judge LLM scoring for nuanced evaluation.