gemma-4-26b-a4b

2026-05-15T08-13-16Z_google_gemma-4-26b-a4b_paperwork_workflow_case05

50% Practical score
0/1 Resolved tasks
1/1 Core tasks

Verdict

practical run result
near miss

Core 1/1 · Visible 1/1 · Hidden 0/1

final_document_set_errormissing_or_wrong_evidenceproof_code_errorproof_txt_error
Prompt set v3_paperwork_workflow_case05.json
Run type local
Temperature 0
Top P 1
Missed tasks 1/1

Task Results

Compact view; open a row for prompt and raw answer
Task Category Score Failed Checks Failure Types Prompt / Answer
email_attachment_intake_case_05 paperwork 3/4 hidden_oracle_pass final_document_set_error, missing_or_wrong_evidence, proof_code_error, proof_txt_error case files
Inspect

Prompt

Audit the synthetic email attachment intake folder in `incoming/`.

Write these files:
- `work/attachment_index.json`
- `work/final_document_set.json`
- `work/normalized/INV-2204-R1.txt`
- `audit_result.json`
- `proof.txt`

Use `incoming/README_TASK.md` as the full task specification. Use the generated image attachments, the email thread, `payment_export.csv`, `vendor_master.csv`, and `purchase_orders.csv`.

Important:
- Identify the final revised invoice, not just the first invoice-like image.
- The payment may use an old reference if the amount matches the revised invoice.
- Do not modify anything under `incoming/`.
- This is benchmark bookkeeping, not financial, tax, or legal advice.
- `proof.txt` must contain only the numeric proof code.

Workspace files available to the model

README_TASK.md
# Paperwork Workflow Case 05: Email Attachment Intake

Case ID: P3-WORK-05

You are auditing a synthetic email attachment intake folder. The folder contains generated document images, a thread note, payment data, vendor data, and a purchase order list.

Create these final artifacts:
- `work/attachment_index.json`
- `work/final_document_set.json`
- `work/normalized/INV-2204-R1.txt`
- `audit_result.json`
- `proof.txt`

Rules:
- Identify the active final document set.
- Use the email thread and the chat screenshot as evidence.
- Ignore superseded invoices, proforma documents, and hint screenshots as payment documents.
- Do not modify anything in `incoming/`.
- Return no advice. This is benchmark bookkeeping, not financial, tax, or legal advice.
- `proof.txt` must contain only the final numeric `proof_code`.

Final document logic:
- The first `INV-2204` attachment is superseded because the thread and chat hint say it had the wrong VAT.
- The revised attachment from May 8 is the final invoice: `INV-2204-R1`.
- `PF-2205` is a proforma invoice and is not a payment invoice.
- The bank payment may still use the old reference `INV-2204`; map it to `INV-2204-R1` only when the revised gross amount matches.

Use these document IDs for the attachment index and ignored-document lists:
- old invoice image: `INV-2204`
- revised invoice image: `INV-2204-R1`
- proforma image: `PF-2205`
- chat screenshot: `CHAT-MAY-08`

`work/attachment_index.json` schema:

```json
{
  "case_id": "P3-WORK-05",
  "attachments": [
    {
      "attachment_path": "",
      "document_id": "",
      "document_type": "",
      "decision": ""
    }
  ]
}
```

Allowed `document_type` values:
- `invoice`
- `proforma`
- `chat_hint`

Allowed `decision` values:
- `superseded`
- `final`
- `ignored`
- `evidence_only`

`work/final_document_set.json` schema:

```json
{
  "case_id": "P3-WORK-05",
  "final_invoice_ids": [],
  "superseded_invoice_ids": [],
  "ignored_document_ids": [],
  "payment_mapped_from": "",
  "payment_mapped_to": ""
}
```

The normalized invoice file must use exactly these eight lines:

```text
invoice_id=...
replaces_invoice_id=...
vendor_id=...
vendor_name=...
po_id=...
gross_total_cents=...
payment_reference=...
payment_match=...
```

`payment_match` must be exactly `true` or `false`.

`audit_result.json` must contain exactly these keys:
- case_id
- approved_invoice_ids
- review_invoice_ids
- reject_invoice_ids
- ignored_document_ids
- total_approved_gross_cents
- warnings_by_invoice
- evidence
- proof_code

`warnings_by_invoice` must include every final real invoice ID. Use an empty array when an invoice has no warnings.

Allowed warning codes:
- payment_missing
- payment_amount_mismatch
- inactive_vendor
- missing_po
- superseded_invoice
- non_payment_document

Approval rules:
- Approve only final invoices from active vendors with an open matching PO and an exact payment match.
- Put final invoices with missing payment, amount mismatch, missing PO, or inactive vendor into review.
- Do not approve superseded invoices or proforma documents.
- `ignored_document_ids` must include superseded invoices, proforma documents, and evidence-only screenshots.
- `total_approved_gross_cents` is the sum of approved final invoice gross totals only.
- `evidence` must list the files used to decide the final document set and payment mapping, with paths relative to the workspace.
- In `audit_result.json`, `evidence` should list only files that support the approved final invoice and payment mapping. Do not list `README_TASK.md` as evidence, and do not list every attachment just because it exists.

Proof code formula:

`proof_code = total_approved_gross_cents + numeric_token_for_final_invoice_ids + 97 * ignored_document_count + 503 * payment_revision_mapping_count`

For this case, the numeric token for `INV-2204-R1` is `22041`.

`payment_revision_mapping_count` is `1` when a payment with old reference `INV-2204` is correctly mapped to final invoice `INV-2204-R1`; otherwise it is `0`.
email_thread.txt
Subject: Harbor Office Supply attachment cleanup

2026-05-06 10:12 Mira:
The first Harbor Office Supply invoice was attached as INV-2204. Please hold it for now. The VAT looks wrong.

2026-05-08 09:07 Mira:
Please ignore the first INV-2204 attachment. Wrong VAT on that attachment.

2026-05-08 09:11 Jon:
Use the revised attachment from May 8. Revised one is INV-2204-R1.

2026-05-08 09:18 Jon:
The bank reference may still show the old invoice number, but the amount should match the revised total.

2026-05-08 09:30 AP intake:
The proforma document is for quote tracking only. It is not a payment invoice.
payment_export.csv
date,description,amount_cents,reference
2026-05-09,HARBOR OFFICE SUPPLY PAYMENT,-24990,INV-2204
2026-05-09,UNRELATED COFFEE SUPPLIES,-1840,RCPT-778
purchase_orders.csv
po_id,vendor_id,gross_limit_cents,status
PO-8801,V-410,26000,open
PO-8802,V-410,12000,draft
vendor_master.csv
vendor_id,vendor_name,tax_id,status
V-410,Harbor Office Supply,TX-410,active
V-411,Harbor Office Supply Old Record,TX-OLD,inactive

Input image

email_attachment_intake_case_05 input
IMG_2048.png
email_attachment_intake_case_05 input
chat_hint.png
email_attachment_intake_case_05 input
doc-7.png
email_attachment_intake_case_05 input
invoice_new.png

Agent actions

list_files -> read_file -> read_file -> read_file -> read_file -> read_file -> mkdir -> write_file -> write_file -> write_file -> write_file -> write_file -> run_checks -> finish

Model answer

## audit_result.json

{"case_id":"P3-WORK-05","approved_invoice_ids":["INV-2204-R1"],"review_invoice_ids":[],"reject_invoice_ids":[],"ignored_document_ids":["INV-2204","PF-2205","CHAT-MAY-08"],"total_approved_gross_cents":24990,"warnings_by_invoice":{"INV-2204-R1":[]},"evidence":["incoming/attachments/invoice_new.png","incoming/attachments/chat_hint.png","incoming/payment_export.csv"],"proof_code":25376}

## proof.txt

25376

## work/attachment_index.json

{"case_id":"P3-WORK-05","attachments":[{"attachment_path":"incoming/attachments/IMG_2048.png","document_id":"INV-2204","document_type":"invoice","decision":"superseded"},{"attachment_path":"incoming/attachments/chat_hint.png","document_id":"CHAT-MAY-08","document_type":"chat_hint","decision":"evidence_only"},{"attachment_path":"incoming/attachments/doc-7.png","document_id":"PF-2205","document_type":"proforma","decision":"ignored"},{"attachment_path":"incoming/attachments/invoice_new.png","document_id":"INV-2204-R1","document_type":"invoice","decision":"final"}]}

## work/final_document_set.json

{"case_id":"P3-WORK-05","final_invoice_ids":["INV-2204-R1"],"superseded_invoice_ids":["INV-2204"],"ignored_document_ids":["PF-2205","CHAT-MAY-08"],"payment_mapped_from":"INV-2204","payment_mapped_to":"INV-2204-R1"}

## work/normalized/INV-2204-R1.txt

invoice_id=INV-2204-R1
replaces_invoice_id=INV-2204
vendor_id=V-410
vendor_name=Harbor Office Supply
po_id=PO-8801
gross_total_cents=24990
payment_reference=INV-2204
payment_match=true

## verify_output.json

{
  "ok": false,
  "core_ok": true,
  "visible_ok": true,
  "failure_types": [
    "final_document_set_error",
    "missing_or_wrong_evidence",
    "proof_code_error",
    "proof_txt_error"
  ],
  "checks": [
    {
      "check": "audit_result_exists",
      "passed": true,
      "message": "Expected final audit_result.json with exact keys."
    },
    {
      "check": "visible_checks_pass",
      "passed": true,
      "message": "Expected required artifacts and unchanged incoming/ sources."
    },
    {
      "check": "core_oracle_pass",
      "passed": true,
      "message": "Expected core audit fields to match the oracle."
    },
    {
      "check": "hidden_oracle_pass",
      "passed": false,
      "message": "Expected exact audit, artifacts, proof.txt, and unchanged sources."
    }
  ],
  "audit_error": "",
  "actual": {
    "case_id": "P3-WORK-05",
    "approved_invoice_ids": [
      "INV-2204-R1"
    ],
    "review_invoice_ids": [],
    "reject_invoice_ids": [],
    "ignored_document_ids": [
      "INV-2204",
      "PF-2205",
      "CHAT-MAY-08"
    ],
    "total_approved_gross_cents": 24990,
    "warnings_by_invoice": {
      "INV-2204-R1": []
    },
    "evidence": [
      "incoming/attachments/invoice_new.png",
      "incoming/attachments/chat_hint.png",
      "incoming/payment_export.csv"
    ],
    "proof_code": 25376
  },
  "expected": {
    "case_id": "P3-WORK-05",
    "approved_invoice_ids": [
      "INV-2204-R1"
    ],
    "review_invoice_ids": [],
    "reject_invoice_ids": [],
    "ignored_document_ids": [
      "CHAT-MAY-08",
      "INV-2204",
      "PF-2205"
    ],
    "total_approved_gross_cents": 24990,
    "warnings_by_invoice": {
      "INV-2204-R1": []
    },
    "evidence": [
      "incoming/email_thread.txt",
      "incoming/attachments/chat_hint.png",
      "incoming/attachments/invoice_new.png",
      "incoming/payment_export.csv",
      "incoming/purchase_orders.csv",
      "incoming/vendor_master.csv"
    ],
    "proof_code": 47825
  }
}