One model score, many case sets.

The overall score comes from calibrated practical work, while each case set keeps its own outputs, checks, and failure types visible.

3 Live benchmarks
0 Planned benchmarks
52 Total runs
Scanned paperwork

The Paperwork Trial

Synthetic invoice PNG scans plus bank exports, vendor records, purchase orders, and exact audit-result oracles.

8 runs 8 models
generated invoice imagesCSV cross-checksevidence fieldsproof code oracle

Proof model: visible checks are available during the run, but the final score is decided after finish by protected-file checks plus hidden oracles.

The leaderboard ranks this benchmark by Practical Score: half resolved pass@1, half core-oracle pass. Common checks and runner-specific workflow checks remain visible as diagnostics.

Model Type Practical Resolved Near miss Core Visible Failure types Common checks Run
gemma-4-26b-a4b local 80.0% 4/5 (80%) 0/5 4/5 (80%) 5/5 duplicate_risk_missed, evidence_path_format, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error 18/20 (90%) details
codex-default reference 70.0% 3/5 (60%) 1/5 4/5 (80%) 5/5 ignored_document_id_error, proof_code_error 17/20 (85%) details
qwen3.6-27b local 70.0% 3/5 (60%) 1/5 4/5 (80%) 5/5 duplicate_risk_missed, evidence_path_format, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error 17/20 (85%) details
gemma-4-e4b local 50.0% 2/5 (40%) 1/5 3/5 (60%) 4/5 duplicate_risk_missed, evidence_path_format, ignored_document_id_error, invoice_classification_error, invoice_id_format_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error 14/20 (70%) details
qwen3.6-35b-a3b local 40.0% 1/5 (20%) 2/5 3/5 (60%) 3/5 evidence_path_format 10/20 (50%) details
gemma-4-31b-it local 20.0% 0/5 (0%) 2/5 2/5 (40%) 5/5 duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error 12/20 (60%) details
gemma-4-e2b local 0.0% 0/5 (0%) 0/5 0/5 (0%) 5/5 duplicate_risk_missed, evidence_path_format, ignored_document_id_error, invoice_classification_error, missing_or_wrong_evidence, proof_code_error, total_calculation_error, warning_code_error 10/20 (50%) details
ministral-3-3b local 0.0% 0/5 (0%) 0/5 0/5 (0%) 0/5 duplicate_risk_missed, invoice_classification_error, proof_code_error, total_calculation_error, warning_code_error 1/20 (5%) details
Agentic paperwork folders

Paperwork Workflow

Synthetic messy intake and email-attachment workflows with generated scans, protected sources, normalized artifacts, payment remapping, and hidden oracles.

32 runs 8 models
source selectiongenerated imagesprotected folderproof.txt oracle

Proof model: visible checks are available during the run, but the final score is decided after finish by protected-file checks plus hidden oracles.

The leaderboard ranks this benchmark by Practical Score: half resolved pass@1, half core-oracle pass. Common checks and runner-specific workflow checks remain visible as diagnostics.

Model Type Practical Resolved Near miss Core Visible Failure types Common checks Run
codex-default reference 100.0% 4/4 (100%) 0/4 4/4 (100%) 4/4 none 16/16 (100%) details best of 4
qwen3.6-27b local 75.0% 2/4 (50%) 2/4 4/4 (100%) 4/4 final_document_set_error 14/16 (88%) details best of 4
gemma-4-26b-a4b local 37.5% 0/4 (0%) 3/4 3/4 (75%) 4/4 final_document_set_error, manifest_error, missing_or_wrong_evidence, normalized_text_error, proof_code_error, proof_txt_error, warning_code_error 11/16 (69%) details best of 4
gemma-4-31b-it local 37.5% 0/4 (0%) 3/4 3/4 (75%) 4/4 final_document_set_error, normalized_text_error, proof_code_error, proof_txt_error, warning_code_error 11/16 (69%) details best of 4
qwen3.6-35b-a3b local 37.5% 0/4 (0%) 3/4 3/4 (75%) 4/4 audit_result_wrong_location, manifest_error, missing_or_wrong_evidence 10/16 (63%) details best of 4
gemma-4-e2b local 0.0% 0/4 (0%) 0/4 0/4 (0%) 4/4 document_index_error, final_document_set_error, format_failure, manifest_error, missing_or_wrong_evidence, no_output, normalized_text_error, payment_reconciliation_error, proof_code_error, proof_txt_error, total_calculation_error, warning_code_error 5/16 (31%) details best of 4
gemma-4-e4b local 0.0% 0/4 (0%) 0/4 0/4 (0%) 3/4 attachment_index_error, document_index_error, final_document_set_error, format_failure, manifest_error, missing_or_wrong_evidence, normalized_text_error, proof_code_error, proof_txt_error, required_artifact_missing, total_calculation_error, warning_code_error 5/16 (31%) details best of 4
ministral-3-3b local 0.0% 0/4 (0%) 0/4 0/4 (0%) 0/4 attachment_index_error, document_index_error, final_document_set_error, manifest_error, no_output, normalized_text_error, payment_reconciliation_error, proof_txt_error, required_artifact_missing 0/16 (0%) details best of 4
Constrained SVG visual

City Plan SVG

A city-plan SVG prompt with roads, blocks, and 3D or isometric buildings. Valid vector output, no Markdown excuses.

12 runs 9 models
valid SVGcity-plan constraintsshareable artifact

Visual sample: this is one constrained SVG prompt, not a statistical benchmark. It is shown as pass/review/fail with checks and the generated artifact for manual inspection.

A pass only means the output met the automated SVG and constraint checks. Visual quality still needs a human look.

Model Type Result Checks SVG preview Run
gemma-4-31b-it local pass 3/3 gemma-4-31b-it city plan SVG preview 1 SVG details
codex-default reference pass 3/3 codex-default city plan SVG preview 1 SVG details
gpt-oss-20b:free reference pass 3/3 gpt-oss-20b:free city plan SVG preview 1 SVG details
gemma-4-e4b local pass 3/3 gemma-4-e4b city plan SVG preview 1 SVG details
ministral-3-3b local review 2/3 ministral-3-3b city plan SVG preview 1 SVG details
gemma-4-e2b local review 2/3 gemma-4-e2b city plan SVG preview 1 SVG details
qwen3.6-35b-a3b local fail 2/3 No SVG output details best of 2
qwen3.6-27b local fail 1/3 No SVG output details best of 2
gemma-4-26b-a4b local fail 1/3 No SVG output details best of 2