qwen3.6-35b-a3b
2026-05-14T19-29-15Z_qwen_qwen3.6-35b-a3b_paperwork_workflow_case04
50% Practical score
0/1 Resolved tasks
1/1 Core tasks
Verdict
practical run result near miss
Core 1/1 · Visible 1/1 · Hidden 0/1
manifest_error
Prompt set v3_paperwork_workflow_case04.json
Run type local
Temperature 0
Top P 1
Missed tasks 1/1
Task Results
Compact view; open a row for prompt and raw answer| Task | Category | Score | Failed Checks | Failure Types | Prompt / Answer |
|---|---|---|---|---|---|
| messy_intake_workflow_case_04 | paperwork | 3/4 | hidden_oracle_pass | manifest_error | case files InspectPrompt You are in a synthetic paperwork benchmark workspace. Audit the messy intake folder in `incoming/`. Create these final artifacts: - `work/normalized_manifest.json` - `work/document_index.json` - `work/normalized/INV-9107.txt` - `work/normalized/INV-9108.txt` - `work/normalized/INV-9109.txt` - `audit_result.json` - `proof.txt` Rules: - Identify the active source files. - Ignore drafts, old exports, duplicate vendor files, and non-invoice docs. - Do not modify anything in `incoming/`. - Return no advice. This is benchmark bookkeeping, not financial, tax, or legal advice. - `proof.txt` must contain only the final numeric `proof_code`. Artifact schemas: - `work/normalized_manifest.json` must be one JSON object with keys `case_id`, `active_files`, `ignored_files`, `normalized_files`. - `work/document_index.json` must be one JSON object with keys `case_id`, `documents`. - Each `documents` entry must use keys `document_id`, `document_type`, `source_path`, `normalized_path`. - Normalized invoice text files must use exactly this line format: - `invoice_id=...` - `vendor_id=...` - `vendor_name=...` - `po_id=...` - `gross_total_cents=...` - `warnings_by_invoice` must include every real invoice ID. Use an empty array when an invoice has no warnings. Workspace files available to the model README_TASK.md# Paperwork Workflow Case 04: Messy Intake Folder
Case ID: P3-WORK-04
You are auditing a messy synthetic intake folder. Some files are active sources, some are stale drafts, and one scan is not an invoice.
Create:
- `work/normalized_manifest.json`
- `work/document_index.json`
- `work/normalized/INV-9107.txt`
- `work/normalized/INV-9108.txt`
- `work/normalized/INV-9109.txt`
- `audit_result.json`
- `proof.txt`
`work/normalized_manifest.json` schema:
```json
{
"case_id": "P3-WORK-04",
"active_files": [],
"ignored_files": [],
"normalized_files": []
}
```
`active_files` should list the active final sources used for the audit. `ignored_files` should list stale drafts, old exports, duplicate vendor files, and non-invoice scans. `normalized_files` should list the normalized invoice text files you create.
`work/document_index.json` schema:
```json
{
"case_id": "P3-WORK-04",
"documents": [
{
"document_id": "",
"document_type": "",
"source_path": "",
"normalized_path": ""
}
]
}
```
Use `document_type` values `invoice` or `credit_note`. For ignored non-invoice documents, set `normalized_path` to an empty string.
Each normalized invoice file must use exactly these five lines:
```text
invoice_id=...
vendor_id=...
vendor_name=...
po_id=...
gross_total_cents=...
```
Use the final bank export, the active vendor master, and the final purchase order list:
- final bank export: `incoming/bank-may-final.csv`
- active vendor master: `incoming/vendor_master.csv`
- final purchase orders: `incoming/purchase_orders.csv`
Ignore:
- `incoming/bank_export_old.csv`
- `incoming/vendors copy.csv`
- `incoming/po-list-draft.csv`
- the non-invoice document scan
The scan files contain the source document text. If your tool returns OCR text for an image, use it as the scan reading and cross-check it against the CSV files.
`audit_result.json` must contain exactly these keys:
- case_id
- approved_invoice_ids
- review_invoice_ids
- reject_invoice_ids
- ignored_document_ids
- total_approved_gross_cents
- warnings_by_invoice
- evidence
- proof_code
`warnings_by_invoice` must include every real invoice ID. Use an empty array when an invoice has no warnings.
Allowed warning codes:
- payment_short
- missing_payment
- missing_po
- inactive_vendor
Rules:
- Approve only active-vendor invoices with matching PO and exact final-bank payment.
- Put invoices with short payment, missing payment, missing PO, or inactive vendor into review.
- Reject only when a document is clearly invalid as an invoice. Non-invoice documents belong in `ignored_document_ids`.
- `total_approved_gross_cents` is the sum of approved invoice gross totals only.
- `evidence` must list the active source files used for the audit, with paths relative to the workspace.
Proof code formula:
`proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`
`proof.txt` must contain only the proof code number and nothing else.
bank-may-final.csvdate,description,amount_cents,reference 2026-05-03,ALPHA DESK SYSTEMS INV-9107,-11900,INV-9107 2026-05-04,BETA OFFICE SUPPLY INV-9108,-23055,INV-9108 2026-05-06,ORION LEGACY PARTS PAYMENT,-6400,INV-9099 bank_export_old.csvdate,description,amount_cents,reference 2026-04-28,OLD TEST EXPORT DO NOT USE,-24855,INV-9108 2026-04-29,OLD ORION TEST EXPORT,-8740,INV-9109 vendor_master.csvvendor_id,vendor_name,tax_id,status V-321,Alpha Desk Systems,TX-321,active V-654,Beta Office Supply,TX-654,active V-777,Orion Legacy Parts,TX-777,inactive vendors copy.csvvendor_id,vendor_name,tax_id,status V-321,Alpha Desk Systems,TX-321,active V-654,Beta Office Supply,TX-654,inactive V-777,Orion Legacy Parts,TX-777,active purchase_orders.csvpo_id,vendor_id,gross_limit_cents,status PO-7001,V-321,12500,open PO-7002,V-654,25000,open PO-7003,V-777,9000,cancelled po-list-draft.csvpo_id,vendor_id,gross_limit_cents,status PO-7001,V-321,12500,open PO-7002,V-654,25000,cancelled PO-7003,V-777,9000,open notes_from_ap.txtUse bank-may-final.csv, not the old export. Use vendor_master.csv, not the copy. Use purchase_orders.csv, not the draft list. The credit note scan is not an invoice for this audit. INV-9109 should not be approved while the vendor is inactive and the listed PO is cancelled. Input image Agent actions list_files -> read_file -> read_file -> read_file -> read_file -> mkdir -> write_file -> write_file -> write_file -> write_file -> write_file -> write_file -> write_file -> finish Model answer ## audit_result.json
{
"case_id": "P3-WORK-04",
"approved_invoice_ids": ["INV-9107"],
"review_invoice_ids": ["INV-9108", "INV-9109"],
"reject_invoice_ids": [],
"ignored_document_ids": ["CN-7710"],
"total_approved_gross_cents": 11900,
"warnings_by_invoice": {
"INV-9107": [],
"INV-9108": ["payment_short"],
"INV-9109": ["inactive_vendor", "missing_po", "missing_payment"]
},
"evidence": [
"incoming/bank-may-final.csv",
"incoming/vendor_master.csv",
"incoming/purchase_orders.csv"
],
"proof_code": 39612
}
## proof.txt
39612
## work/document_index.json
{
"case_id": "P3-WORK-04",
"documents": [
{
"document_id": "INV-9107",
"document_type": "invoice",
"source_path": "incoming/scan_001.png",
"normalized_path": "work/normalized/INV-9107.txt"
},
{
"document_id": "INV-9108",
"document_type": "invoice",
"source_path": "incoming/scan_002.png",
"normalized_path": "work/normalized/INV-9108.txt"
},
{
"document_id": "INV-9109",
"document_type": "invoice",
"source_path": "incoming/scan_003.png",
"normalized_path": "work/normalized/INV-9109.txt"
},
{
"document_id": "CN-7710",
"document_type": "credit_note",
"source_path": "incoming/quote_or_credit_note.png",
"normalized_path": ""
}
]
}
## work/normalized/INV-9107.txt
invoice_id=INV-9107
vendor_id=V-321
vendor_name=Alpha Desk Systems
po_id=PO-7001
gross_total_cents=11900
## work/normalized/INV-9108.txt
invoice_id=INV-9108
vendor_id=V-654
vendor_name=Beta Office Supply
po_id=PO-7002
gross_total_cents=24855
## work/normalized/INV-9109.txt
invoice_id=INV-9109
vendor_id=V-777
vendor_name=Orion Legacy Parts
po_id=PO-7003
gross_total_cents=8740
## work/normalized_manifest.json
{
"case_id": "P3-WORK-04",
"active_files": [
"incoming/bank-may-final.csv",
"incoming/vendor_master.csv",
"incoming/purchase_orders.csv",
"incoming/scan_001.png",
"incoming/scan_002.png",
"incoming/scan_003.png"
],
"ignored_files": [
"incoming/bank_export_old.csv",
"incoming/vendors copy.csv",
"incoming/po-list-draft.csv",
"incoming/quote_or_credit_note.png",
"incoming/notes_from_ap.txt"
],
"normalized_files": [
"work/normalized/INV-9107.txt",
"work/normalized/INV-9108.txt",
"work/normalized/INV-9109.txt"
]
}
## verify_output.json
{
"ok": false,
"core_ok": true,
"visible_ok": true,
"failure_types": [
"manifest_error"
],
"checks": [
{
"check": "audit_result_exists",
"passed": true,
"message": "Expected final audit_result.json with exact keys."
},
{
"check": "visible_checks_pass",
"passed": true,
"message": "Expected required artifacts and unchanged incoming/ sources."
},
{
"check": "core_oracle_pass",
"passed": true,
"message": "Expected core audit fields to match the oracle."
},
{
"check": "hidden_oracle_pass",
"passed": false,
"message": "Expected exact audit, artifacts, proof.txt, and unchanged sources."
}
],
"audit_error": "",
"actual": {
"case_id": "P3-WORK-04",
"approved_invoice_ids": [
"INV-9107"
],
"review_invoice_ids": [
"INV-9108",
"INV-9109"
],
"reject_invoice_ids": [],
"ignored_document_ids": [
"CN-7710"
],
"total_approved_gross_cents": 11900,
"warnings_by_invoice": {
"INV-9107": [],
"INV-9108": [
"payment_short"
],
"INV-9109": [
"inactive_vendor",
"missing_po",
"missing_payment"
]
},
"evidence": [
"incoming/bank-may-final.csv",
"incoming/vendor_master.csv",
"incoming/purchase_orders.csv"
],
"proof_code": 39612
},
"expected": {
"case_id": "P3-WORK-04",
"approved_invoice_ids": [
"INV-9107"
],
"review_invoice_ids": [
"INV-9108",
"INV-9109"
],
"reject_invoice_ids": [],
"ignored_document_ids": [
"CN-7710"
],
"total_approved_gross_cents": 11900,
"warnings_by_invoice": {
"INV-9107": [],
"INV-9108": [
"payment_short"
],
"INV-9109": [
"inactive_vendor",
"missing_payment",
"missing_po"
]
},
"evidence": [
"incoming/scan_001.png",
"incoming/scan_002.png",
"incoming/scan_003.png",
"incoming/bank-may-final.csv",
"incoming/vendor_master.csv",
"incoming/purchase_orders.csv"
],
"proof_code": 39612
}
}
|