Two benchmarks run side by side.Generalist (80 prompts, 8 categories) tests breadth. Causal (100 multiple-choice questions, 5 variants) tests narrow causal-inference reasoning. Both display 0 to 100. They are never blended.
Generalist is scored by three layers. Auto-checks flag mechanical failures (format, hallucination, sycophancy). Four LLM judges score 1 to 5. DeepEval rates correctness, coherence, and instruction-following 0 to 1. Composite blends judge and DeepEval.
Causal is deterministic. Multiple choice. No judges, no DeepEval. Accuracy = correct ÷ valid. Errors (API failures) and Invalid (empty/unextractable response) are reported separately.
Self-judging is prevented. A judge LLM never scores responses from its own family (gpt-4.1 does not judge gpt-4o, etc.). See the Judge Audit for divergence and agreement evidence.
Some models are excluded. Retired APIs, paid-tier-only providers, and broken model paths are excluded from causal so the leaderboard isn't padded with zeros. They still appear on the Generalist board where they ran cleanly.
Generalist Benchmark
Focus
This evaluation measures what matters for practical, day-to-day use of LLMs as a working tool.
It is not a general knowledge benchmark or a trivia test. The prompt set is designed around
tasks a developer, researcher, or technical writer would actually ask an LLM to do, with
emphasis on scenarios where models commonly fail or diverge.
What we test for
Accuracy under pressure - trap questions, false premises, phantom bugs, and wrong claims that tempt sycophantic agreement
Honest calibration - does the model hedge when uncertain, refuse when appropriate, and acknowledge its own limitations?
Instruction following - exact format compliance, word count targets, constraint adherence, and banned word avoidance
Reasoning depth - multi-step problems, causal reasoning, estimation, and the ability to show work rather than guess
Practical coding - real debugging scenarios, architecture decisions, code review, and implementation - not leetcode
Writing quality - tone control, concision, editing skill, and the ability to adapt style to audience
What we deliberately avoid
Trivia and memorization (Wikipedia knowledge is cheap)
Simple Q&A that any model can pass
Prompts with only one valid answer format
Benchmarks that reward verbosity over substance
Evaluation Pipeline
Each model runs through the same pipeline for every prompt:
Prompt sent to model → Response collected with latency/token counts →
Automated checks run → LLM judge scores 1-5 with rationale →
DeepEval G-Eval metrics (correctness, coherence, instruction following) →
Composite score computed (weighted merge of judge + DeepEval) →
Results persisted as JSON
All models receive identical prompts with temperature: 0 for reproducibility
No system prompts are injected - models receive only the raw user prompt
Each prompt has a defined ideal answer and scoring criteria that the judge evaluates against
Results are append-only - re-running a model adds a new entry, preserving history
Check-type breakdown (Layers 1 and 2)Tables of all auto-check and judge-only check types. Click to expand.
Auto-Checks (Layer 1)
Deterministic, heuristic checks. Run instantly on every response and flag mechanical failures.
Check Type
Prompts
acknowledges nonexistence
1
ambiguity check
2
banned words
3
code runnable
5
constraint check
2
hallucination api
1
json valid
1
multi step verify
3
refusal check
3
response length
3
self awareness
1
statistical significance
1
sycophancy check
5
table format
1
trap common error
1
trap no bug
1
trap wrong claim
1
word count
2
word count reduction
1
Judge-Only (Layer 2)
These check types have no automated heuristic. The LLM judge scores them entirely on quality and reasoning.
Check Type
Prompts
analysis
1
behavioural
3
calibration
2
checklist
2
comparison
3
format check
3
reasoning
26
synthesis
2
Multi-Judge Scoring
Each model response is scored by multiple independent LLM judges, each rating it 1 to 5. The current judges are gpt-4.1, claude-sonnet-4.6, gemini-2.5-flash, qwen3-235b.
Each judge receives the original prompt, the ideal answer, the scoring criteria, and any
auto-check flags. It returns a score and a short rationale.
Averaging rules
A judge's scores only count toward the average if it has scored every scorable prompt for that model - partial coverage is excluded entirely
The displayed judge score is the mean of each qualifying judge's global average (equal weight per judge)
Self-judging is prevented - a judge model does not score its own responses (e.g. gpt-4.1 does not judge gpt-4.1)
Click any row on the leaderboard to see per-judge score breakdowns
5Excellent - fully addresses the prompt, accurate, well-structured, meets all criteria4Good - mostly correct with minor gaps or style issues3Adequate - partially addresses the prompt, some errors or missing elements2Poor - significant errors, missing key requirements, or off-topic1Failing - wrong, harmful, empty, or completely misses the point
Judge guidelines
Hallucinated facts, fabricated references, and confident wrong answers are penalised
Appropriate hedging, asking for clarification, and refusing harmful requests are rewarded
Auto-check flag failures lower the score
A 3 is average, 5 is genuinely excellent - the scale is strict but fair
DeepEval G-Eval Scoring (Layer 3)
In addition to the multi-judge scores, each response is scored by
DeepEval
using G-Eval metrics - research-backed LLM evaluation criteria that provide
multi-dimensional scoring on a 0-1 scale.
Metrics
CorrectnessIs the response factually correct compared to the expected output? Penalises contradictions, omissions, and hallucinations.CoherenceDoes the response have clear logical flow, good structure, and present ideas without contradictions?Instruction FollowingDoes the response address all parts of the prompt and adhere to format, length, and constraint requirements?
How it works
Each metric uses a chain-of-thought evaluation via the same judge model
DeepEval's native scores are 0 to 1; the dashboard displays them as 0 to 100 for parity with the other panels
DeepEval supplements the LLM judge rather than replacing it. Both signals appear separately on the leaderboard so you can see when they agree.
Composite Score (Generalist)
The composite is the headline number on the Generalist leaderboard. It blends two signals
into one, displayed as 0 to 100. The dashboard computes everything internally on a
0 to 1 scale, then multiplies by 100 for display.
where normalized_judge = (avg_judge_score - 1) / 4 rescales the 1 to 5 judge
average into 0 to 1, and deepeval_avg is the mean of correctness, coherence,
and instruction-following metrics. Only judges with complete coverage (scored every
prompt for that model) contribute to the average; partial-coverage judges are excluded
entirely to avoid biased subsets.
Fallback behavior
Both scores available - weighted average (default 50/50)
Only judge score - composite = normalized judge
Only DeepEval - composite = DeepEval average
Neither - no composite score
The Causal benchmark uses a different scoring scheme (deterministic multiple
choice, no judges, no DeepEval). The two scores are reported side by side on the leaderboard
but never blended.
Efficiency Metric
The efficiency score balances quality against verbosity:
efficiency = avg_score / log2(avg_tokens).
This rewards models that achieve high scores without padding responses with unnecessary tokens.
A concise, correct answer scores higher than an equally correct but bloated one.
Reasoning Models
Reasoning-capable models (gpt-5.x, o3-mini, o4-mini, Gemini Pro reasoning) spend output
tokens on hidden chain-of-thought before emitting any visible answer. When the response
budget runs out before the model writes its final answer, the API returns success but
with empty text. The dashboard counts these as Invalid rather than
Errors, so a token-budget failure is distinguishable from an API failure.
This matters when comparing models: an "invalid" rate on a reasoning model is a real
capability signal, not a network problem.
Refusals
Some models return empty or near-empty responses on prompts they consider sensitive
(security topics, network scanning, fictitious libraries). These are counted as
Refusals and excluded from quality scores (Judge, DeepEval, Composite).
This prevents safety-layer behavior from artificially deflating a model's quality rating.
The refusal count is shown separately so you can see both how good a model's answers
are and how often it declines to answer.
Bug detection (including trap prompts with no bug), code generation, debugging, architecture design, security review, refactoring, concurrency, ML implementation, and cross-language tasks. Medium to hard difficulty.
Instruction Following
8
ambiguity handling, conflicting constraints, creative constraint, exact format, format compliance, multi constraint, multi step, refusal calibration
Exact format compliance, multi-constraint tasks, conflicting instructions, creative constraints, and ambiguity handling. Tests literal instruction adherence.
Technical explanations, factual accuracy, nuanced comparisons, calibration, and trap questions testing common misconceptions. Tests depth of understanding vs surface-level answers.
Meta
5
calibration, honesty under pressure, self knowledge, trap, uncertainty
Self-knowledge, calibration, honesty under pressure, and uncertainty expression. Tests whether models know what they don't know.
Reasoning
12
causal reasoning, estimation, ethical tradeoff, evidence evaluation, expected value, false premise, logic, math with distractors, software tradeoffs, statistics, tradeoff analysis
Fermi estimation, logic puzzles, statistical analysis, ethical tradeoffs, causal reasoning, and false premise detection. Tests whether models show their work and catch tricks.
Source synthesis, contradictory evidence handling, technical evaluation, and summarization fidelity. Tests analytical depth over breadth.
Writing
10
anti slop, argumentation, constraint following, documentation, editing, email drafting, structured, technical writing, tone switching
Technical writing, tone switching, anti-slop detection, constrained writing, editing, email drafting, and argumentation. Tests natural voice and format compliance.
Difficulty Distribution
Difficulty
Prompts
easy
10
medium
36
hard
34
Benchmark Integrity
Exact prompts are not published to prevent models from being tuned to this specific benchmark.
Categories, evaluation criteria, and scoring methodology are fully documented above.
Each prompt is scored by automated checks where applicable, plus multi-judge LLM scoring for nuanced evaluation.
Causal Reasoning Benchmark
Overview
The causal benchmark is a separate 100-question suite focused on causal inference.
Twenty concept bundles (confounding, colliders, mediators, selection, time-varying
confounding, transportability, etc.) each have five variants that test the same
underlying concept from different angles. All questions are multiple choice with
deterministic scoring - no LLM judge or DeepEval involvement.
Current version: 2.4. Questions are not published, to prevent models being tuned to this specific benchmark.
Live leaderboard at Causal Reasoning →.
Looks like the base concept applies but the obvious answer is wrong; tests when a principle does NOT apply
Transfer
Formal DAG reasoning with short elimination-style options (set notation, path counts, yes/no with reason)
Numeric
Multi-step calculation with tables and conditional probabilities; can't be answered by intuition alone
Analyst
Two analysts debate the same scenario; identify which assessment is most accurate
Scoring
Every question has one correct letter, so the causal benchmark is purely deterministic:
no LLM judges, no DeepEval. Just correct or incorrect, counted out of 100.
A handful of models are excluded from this benchmark because they cannot be fairly evaluated
here (retired APIs, paid-tier-only providers, broken model paths). They still appear on the
Generalist leaderboard where they ran cleanly.
What the dashboard shows
Accuracy - correct ÷ valid responses, excluding errors and invalid
Score - correct out of 100, the absolute count regardless of failures
Errors - API failures (rate limits, timeouts, server errors). An operational issue, not a capability signal.
Invalid - the model returned successfully but with no extractable answer. Common with reasoning models that exhaust their token budget on hidden chain-of-thought before emitting text. A capability signal.
Per-variant accuracy - which of the five reasoning angles the model handles best and worst
Hardening history (v2.0 to v2.4)How the benchmark was iteratively hardened against gaming. Click to expand.
The current version is the result of four structural-hardening iterations against
a cheap baseline (Claude Haiku 3). Each round discovered a new way the benchmark
could be gamed without causal reasoning.
v2.0 - initial 100-question release. Haiku 74%. Opus 4.7 at 82%. Low differentiation.
v2.1 - content hardening (sophisticated distractors). Haiku only moved to 70%. Investigation revealed a structural tell.
v2.2 - length normalization on transfer + analyst. "Always pick longest option" attack rate dropped 71% to 57%, but Haiku still hit 90% on transfer via a semantic tell.
v2.3 - narrative transfer replaced with paragraph-long DAG questions. Opus saturated at 100%, frontier differentiation lost.
v2.4 - transfer rewritten with elimination-style short options. Haiku 40% / Opus 55% on transfer, no saturation at either end, length bias eliminated.