This evaluation measures what matters for practical, day-to-day use of LLMs as a working tool. It is not a general knowledge benchmark or a trivia test. The prompt set is designed around tasks a developer, researcher, or technical writer would actually ask an LLM to do, with emphasis on scenarios where models commonly fail or diverge.
Each model runs through the same pipeline for every prompt:
temperature: 0 for reproducibilityDeterministic, heuristic checks. Run instantly on every response and flag mechanical failures.
| Check Type | Prompts |
|---|---|
| acknowledges nonexistence | 1 |
| ambiguity check | 2 |
| banned words | 3 |
| code runnable | 5 |
| constraint check | 2 |
| hallucination api | 1 |
| json valid | 1 |
| multi step verify | 3 |
| refusal check | 3 |
| response length | 3 |
| self awareness | 1 |
| statistical significance | 1 |
| sycophancy check | 5 |
| table format | 1 |
| trap common error | 1 |
| trap no bug | 1 |
| trap wrong claim | 1 |
| word count | 2 |
| word count reduction | 1 |
These check types have no automated heuristic. The LLM judge scores them entirely on quality and reasoning.
| Check Type | Prompts |
|---|---|
| analysis | 1 |
| behavioural | 3 |
| calibration | 2 |
| checklist | 2 |
| comparison | 3 |
| format check | 3 |
| reasoning | 26 |
| synthesis | 2 |
Each model response is scored by multiple independent LLM judges, each rating it 1 to 5. The current judges are gpt-4.1, claude-sonnet-4.6, gemini-2.5-flash, qwen3-235b. Each judge receives the original prompt, the ideal answer, the scoring criteria, and any auto-check flags. It returns a score and a short rationale.
In addition to the multi-judge scores, each response is scored by DeepEval using G-Eval metrics - research-backed LLM evaluation criteria that provide multi-dimensional scoring on a 0-1 scale.
The composite is the headline number on the Generalist leaderboard. It blends two signals into one, displayed as 0 to 100. The dashboard computes everything internally on a 0 to 1 scale, then multiplies by 100 for display.
where normalized_judge = (avg_judge_score - 1) / 4 rescales the 1 to 5 judge
average into 0 to 1, and deepeval_avg is the mean of correctness, coherence,
and instruction-following metrics. Only judges with complete coverage (scored every
prompt for that model) contribute to the average; partial-coverage judges are excluded
entirely to avoid biased subsets.
The Causal benchmark uses a different scoring scheme (deterministic multiple choice, no judges, no DeepEval). The two scores are reported side by side on the leaderboard but never blended.
The efficiency score balances quality against verbosity:
efficiency = avg_score / log2(avg_tokens).
This rewards models that achieve high scores without padding responses with unnecessary tokens.
A concise, correct answer scores higher than an equally correct but bloated one.
Reasoning-capable models (gpt-5.x, o3-mini, o4-mini, Gemini Pro reasoning) spend output tokens on hidden chain-of-thought before emitting any visible answer. When the response budget runs out before the model writes its final answer, the API returns success but with empty text. The dashboard counts these as Invalid rather than Errors, so a token-budget failure is distinguishable from an API failure. This matters when comparing models: an "invalid" rate on a reasoning model is a real capability signal, not a network problem.
| Category | Prompts | Subcategories | What It Tests |
|---|---|---|---|
| Behavioural | 12 | appropriate refusal, hallucination, sycophancy, unsolicited opinions, verbosity | Sycophancy resistance, hallucination detection, appropriate refusal, verbosity control, and unsolicited opinion avoidance. Tests character and safety alignment. |
| Coding | 15 | algorithm reasoning, architecture, bug detection, code generation, code review, concurrency, cross language, debugging, debugging reasoning, ml implementation, performance, refactoring, security, testing, vague spec | Bug detection (including trap prompts with no bug), code generation, debugging, architecture design, security review, refactoring, concurrency, ML implementation, and cross-language tasks. Medium to hard difficulty. |
| Instruction Following | 8 | ambiguity handling, conflicting constraints, creative constraint, exact format, format compliance, multi constraint, multi step, refusal calibration | Exact format compliance, multi-constraint tasks, conflicting instructions, creative constraints, and ambiguity handling. Tests literal instruction adherence. |
| Learning | 12 | calibration, comparison, concept explanation, emerging, factual, factual accuracy, methodology, nuanced explanation, practical, practical advice, trap | Technical explanations, factual accuracy, nuanced comparisons, calibration, and trap questions testing common misconceptions. Tests depth of understanding vs surface-level answers. |
| Meta | 5 | calibration, honesty under pressure, self knowledge, trap, uncertainty | Self-knowledge, calibration, honesty under pressure, and uncertainty expression. Tests whether models know what they don't know. |
| Reasoning | 12 | causal reasoning, estimation, ethical tradeoff, evidence evaluation, expected value, false premise, logic, math with distractors, software tradeoffs, statistics, tradeoff analysis | Fermi estimation, logic puzzles, statistical analysis, ethical tradeoffs, causal reasoning, and false premise detection. Tests whether models show their work and catch tricks. |
| Research | 6 | comparison, contradictory sources, crash course, summarization fidelity, synthesis, technical evaluation | Source synthesis, contradictory evidence handling, technical evaluation, and summarization fidelity. Tests analytical depth over breadth. |
| Writing | 10 | anti slop, argumentation, constraint following, documentation, editing, email drafting, structured, technical writing, tone switching | Technical writing, tone switching, anti-slop detection, constrained writing, editing, email drafting, and argumentation. Tests natural voice and format compliance. |
| Difficulty | Prompts |
|---|---|
| easy | 10 |
| medium | 36 |
| hard | 34 |
Exact prompts are not published to prevent models from being tuned to this specific benchmark. Categories, evaluation criteria, and scoring methodology are fully documented above. Each prompt is scored by automated checks where applicable, plus multi-judge LLM scoring for nuanced evaluation.
The causal benchmark is a separate 100-question suite focused on causal inference. Twenty concept bundles (confounding, colliders, mediators, selection, time-varying confounding, transportability, etc.) each have five variants that test the same underlying concept from different angles. All questions are multiple choice with deterministic scoring - no LLM judge or DeepEval involvement.
Current version: 2.4. Questions are not published, to prevent models being tuned to this specific benchmark.
Live leaderboard at Causal Reasoning →.
| Variant | What it tests |
|---|---|
| Base | Narrative scenario combining 2-3 interacting causal issues (confounding + selection, mediator + attrition, etc.) |
| Trap | Looks like the base concept applies but the obvious answer is wrong; tests when a principle does NOT apply |
| Transfer | Formal DAG reasoning with short elimination-style options (set notation, path counts, yes/no with reason) |
| Numeric | Multi-step calculation with tables and conditional probabilities; can't be answered by intuition alone |
| Analyst | Two analysts debate the same scenario; identify which assessment is most accurate |
Every question has one correct letter, so the causal benchmark is purely deterministic: no LLM judges, no DeepEval. Just correct or incorrect, counted out of 100.
A handful of models are excluded from this benchmark because they cannot be fairly evaluated here (retired APIs, paid-tier-only providers, broken model paths). They still appear on the Generalist leaderboard where they ran cleanly.
The current version is the result of four structural-hardening iterations against a cheap baseline (Claude Haiku 3). Each round discovered a new way the benchmark could be gamed without causal reasoning.
Full design document: docs/plans/2026-04-10-causal-benchmark-v2-harder.md.