A system without provenance is a system the operator audits twice
Walk into a plant where the operations layer has shipped and watch a buyer review a drafted PO for the first time. The difference between the buyer who approves in three seconds and the buyer who flips to a separate window to verify is the same difference. The first buyer can click a cell, see the source PDF region the value was extracted from, and move on. The second buyer cannot, so they go to the inbox, find the email, open the PDF, scroll to the right page, and read the line themselves.
The two buyers are reviewing the same drafted PO. The system gave them the same data. What is different is whether the provenance is at the cell or at the bottom of the screen. The buyer with cell-level provenance trusts the system. The buyer without it audits it. Audit-twice is the failure mode that kills adoption of automated drafting, and it is structural to where the provenance lives in the data model.
This piece argues that provenance has to be a primitive of the cell, not a query you run later. The shape of the argument is structural: where provenance lives in the data model determines whether the operator trusts the surface or not. Get this wrong and the prettiest UI in the world will not save the deployment.
- →Quote PDF arrives14:08page 3, region x=240,y=440
- →Extract + anchor14:08unit price 52.93 + region bounds
- →Cell render14:08value · provenance dot
- ←Click cell14:09open source pane
- ←Show region14:09PDF page 3, highlighted
- ←Verify or override14:09two-second decision
Figures illustrative · drawn from observed pattern, not a named deployment
The provenance lookup loop, end to end. A source artifact lands. The parser extracts a value and anchors the region. The cell write carries the value and the source reference together. The UI renders the cell with a click-through to the highlighted source. The operator verifies in two seconds.
Observed in two mid-market deployments where the same drafted-PO surface shipped first with footer-level provenance, then was retrofitted to cell-level provenance. Adoption numbers are within-deployment before and after the retrofit, not cross-customer.
Figures illustrative · drawn from observed pattern, not a named deployment
What footer-level provenance gets wrong
The most common shape we see when a team adds "audit trail" to their product is a footer panel. The cell shows the value. At the bottom of the form, there is a collapsible section called "Provenance" or "Sources" that lists the source artifacts that contributed to the form. The list shows the email subject line, the PDF filename, the parser run timestamp, maybe a link to the raw artifact.
The footer panel is technically correct. The provenance is there. The data model knows it. The auditor, if they ever asked, could find it. What the footer panel does not do is make the provenance useful in the moment the operator is deciding whether to trust the cell.
The structural reason is that the operator has to make a decision per cell and the footer panel is per form. The operator looks at the unit price, wonders if that is the right unit price for the right SKU, and has to do a manual join: scroll to the footer, find the source PDF, open it, navigate to the right page, find the right row, verify. That join is roughly forty seconds per cell, and there are eleven cells on a typical drafted PO. Five minutes per PO. Forty POs in a buyer's morning queue. Three and a half hours of footer-walking per day. The buyer stops doing it. They start either approving without checking or escalating to the senior buyer for spot checks. Both are failure modes.
I can see where the data came from. I have to leave the screen I am on to check it. So I do not check it. I approve the ones that look right and flag the ones that do not. The flagging is the same thing I was doing before the layer.
What cell-level provenance changes
The shape that works is cell-level provenance with two-click verification. Every cell on the surface carries a small indicator (a dot, a chip, a kerning subtle enough to be unobtrusive when the operator is reading and visible enough to be findable when they want it). Clicking the indicator opens a side panel that shows the exact source region the value was extracted from, highlighted in context. The buyer reads the line on the source PDF and either accepts the value with a click on the cell or overrides it with a typed value.
The change of shape is small in implementation and large in operator experience. The footer panel is replaced with cell decorations. The collapsible "Sources" section goes away. The data that backs the cell decorations is the same data that backed the footer panel. What changes is where the operator interacts with it.
The two-click time per cell is the load-bearing metric. If the operator can verify a cell in two clicks and two seconds, they will do it. If verification costs forty seconds, they will not. The cell-level architecture is the one that fits inside two seconds.
What the form shows
- ▭PO line 1PMR-2810-1500-S · 280 · $52.93
- ▭PO line 2PMR-3104-CL-A · 560 · $11.40
- ▭VendorV-218
- ▭Req date2026-06-12
- ✎Footer panel · Sources3 artifacts · click to expand
- ✎Verify cost~40s per cell
What the cell carries
- Item PMR-2810-1500-S
- src · quote.pdf p2 r4
- Qty 280
- src · email body line 4
- Unit price 52.93
- src · quote.pdf p3 col B
- Vendor V-218
- src · email sender domain
- Req date 2026-06-12
- src · purchase req PR-9928
- Click target
- highlighted region · 2s
- Override target
- inline typed value
- Audit trail
- same artifacts, surfaced inline
The data model is identical in both cases. The architectural change is that the provenance reference is part of the cell write, not a separate document on the side.
Figures illustrative · drawn from observed pattern, not a named deployment
Footer-level vs. cell-level provenance on the same drafted PO. The data model knows the same things in both cases. What changes is whether the operator can act on the provenance in the moment of decision or has to do a manual join to a separate section of the screen.
- Situation
- Inbound customer RFQs arrived as a one-line email with a PDF drawing attached. Estimating, engineering, and purchasing each opened the drawing separately, often two business days apart.
- What was breaking
- The average quote-to-PO loop was four business days. One-off BOMs were rebuilt from scratch every time because nothing in the ERP keyed off a customer drawing. Half of quotes landed outside the +/-8% margin band the operations VP enforced.
- BOM extraction + RFQ
- Quote-to-procure
- Engineering revisions
The five places provenance has to live for it to work
Cell-level provenance is not free. The data model has to carry the provenance reference in five specific places. Skip any one of them and the operator finds the gap on day three of the rollout.
Provenance is a column on the cell, not a join to a side table
Every cell write carries the source reference inline. The provenance is a structured object: source artifact id, source region (page, bounding box, line number, whatever the artifact format supports), parser version, confidence score. The provenance is written at the same time as the value. Looking up the provenance is a column read on the same row, not a join to a separate audit table.
The reason this matters: a join to a separate audit table is one more place that can be out of sync with the cell. If the value updates and the join misses, the provenance is wrong, and the operator notices because the highlighted region does not match the displayed value. Inlining the provenance with the value makes them atomic. They cannot drift.
Source artifacts are retained verbatim and content-addressable
The cell's source reference points to a stored copy of the source artifact, content-addressed by hash so the same artifact is the same row even if it was re-uploaded. The stored copy is the verbatim original (the actual PDF bytes, the actual email .eml). Not a re-rendered preview. Not a parsed extract. The operator clicking through has to land on the exact thing the parser was looking at, so the verification is meaningful.
Re-rendered previews introduce their own bug class. The renderer might display the PDF differently from how the parser parsed it. The operator looks at the preview, sees the expected text, and approves. Six weeks later the auditor opens the original PDF and finds different text. Verbatim retention is the only shape that prevents this.
The parser version is recorded with the cell, not assumed from runtime
When parser version 1.4 extracted the value, the cell carries "parser 1.4." If the parser is upgraded to 1.5, old cells still reference 1.4. The operator investigating an old discrepancy can reproduce the extraction with the exact parser that produced it. This turns out to matter more than expected, because parser upgrades change extraction behaviour on edge cases, and the discrepancy walk needs to know which extraction algorithm produced the cell in question.[1]
Confidence is a first-class field, not a hidden tunable
The parser knows when it is uncertain. The cell carries that uncertainty as a numeric confidence score (0 to 1) and the UI renders cells with low confidence differently from cells with high confidence. The buyer's eye is drawn to the low-confidence cells. The high-confidence cells recede into a backdrop the buyer trusts by default. The two-second decision is a fast read of the low-confidence subset.
Operator overrides leave a permanent record alongside the original value
When the buyer overrides a parsed value with a typed value, the original parsed value is not deleted. The cell becomes a small two-row history: the parsed value with its source and confidence, plus the override with the operator identity, the timestamp, and an optional reason. The override is the current cell value. The parsed value is there for the auditor to see what the parser thought and what the operator changed it to.
What the click-through experience looks like
The mechanic of the click-through is mundane but worth describing precisely because most teams get it wrong on the first try. The cell has a small visual indicator. The buyer clicks it. A side panel slides in from the right (or a modal opens, depending on screen real estate) showing the source artifact at the relevant region, with the extracted value highlighted in place. The buyer reads the highlighted region. They close the panel. They are back on the form.
Three details matter. First, the click target is the indicator, not the cell itself, because clicking the cell has to remain reserved for editing the value. Second, the panel opens on the same artifact view the buyer expects (a PDF renders as a PDF; an email renders as an email), because translation between formats introduces its own confusion. Third, the panel closes with a single click or keystroke, because anything slower breaks the two-second rule.
The implementation is not exotic. Most modern PDF renderers support region highlighting. Email rendering is a solved problem. The work is in stitching the cell's source reference to the renderer in a way that consistently lands on the right region.
The four specific changes Polymr makes
When a deployment moves from footer-level to cell-level provenance, four concrete things change in the data model and the surface. None of them are large engineering projects. All of them are missing from the typical first version of an operations layer.
- 1Every cell write is a value + source reference + parser version + confidenceThe data model treats provenance as a property of the cell, not a separate audit object. The cell cannot exist without its source reference. A cell with no provenance is a typed override, which has its own provenance (the operator who typed it).
- 2Source artifacts are stored verbatim and content-addressedThe artifact store keeps the bytes the parser parsed, addressable by content hash. Re-uploads of the same artifact resolve to the same stored object. Renders for the UI are produced from the stored bytes on demand and never substituted for the original.
- 3The UI renders provenance inline at the cell, with two-click verificationEach cell carries a small click target. Clicking opens the source artifact at the highlighted region. The operator can verify or override in two clicks total. The footer panel is removed because it is no longer needed.
- 4Overrides leave a permanent two-row history at the cellThe parsed value persists alongside the override. The cell's current value is the override. The parsed value is available to the auditor and to the parser-training feedback loop without reconstructing it from logs.
- Source artifactPDF · email · CAD
- Artifact storecontent-addressed
- Parserextract + region
- Cell writevalue + source
- Operator UIclick-through
- Overrideappended row
- Audit querycell-level
Figures illustrative · drawn from observed pattern, not a named deployment
The provenance pipeline. Source artifacts land and are stored verbatim. The parser extracts values and writes them as cells that carry their source reference inline. The UI renders cells with click-through. Operator overrides append to the cell, not replace it. The audit trail is the same data the operator interacts with.
What this looks like in week one
The first deployment is narrow. Pick one workflow where the operator currently double-checks every drafted record (most plants will hand you a list). Ship the cell-level provenance surface for that workflow with the verbatim artifact store behind it. Run the workflow alongside the existing footer-level surface for a week. Measure the operator's verify time per cell, the drop-to-inbox rate, and the approval throughput. The numbers usually move enough in the first week to justify the rollout to the rest of the workflows.
By week three, the operator has stopped opening the source email to verify routine cells. By week five, the operator has enabled auto-approval for the cells they trust without looking. The auto-approval is the leading indicator that the provenance surface earned its trust. Until that happens, the surface is approved-by-default-but-checked. That is the failure shape to monitor for.
Provenance is not a backend feature. It is the load-bearing surface that decides whether the operator trusts the drafting. Put it at the cell. Make it two clicks. Store the source verbatim. Everything else in the operations layer stacks on top of that.
The objections we hear most often
Three objections come up when we walk a team through this architecture. None of them are wrong. All of them have specific answers.
The first objection is the storage-cost argument. "If you store every source artifact verbatim, the storage bill is going to be enormous." The storage bill for buyer-side artifacts (quote PDFs, email bodies, drawing PDFs) is typically in the tens of gigabytes per year for a mid-market plant. Cheap. The storage bill becomes meaningful at the hundreds-of-thousands-of-records scale and even then is comfortably absorbed by object storage at today's prices. The compliance value of verbatim retention is orders of magnitude higher than the storage line item.
The second objection is the UI-complexity argument. "If every cell carries a click target, the screen is going to look noisy." The click target is intentionally small (a dot or a kerning) and recedes when the operator is not looking for it. The pattern that works is provenance as a subtle decoration on every cell, with low-confidence cells drawn more prominently. The operator's eye lands on the cells that need attention. The high-confidence cells are part of the background.
The third objection is the parser-trust argument. "If we make provenance this prominent, the operator will not trust the parser." This is the inversion of the failure mode and it does not happen in practice. Operators who can verify in two seconds trust the parser more, not less, because they have a fast feedback loop on its accuracy. Operators who cannot verify trust the parser less, because they assume the worst about anything they cannot check. The visible provenance is the trust-builder, not the trust-destroyer.
Footnotes
- [1]The parser-version field comes up most often during discrepancy investigations triggered by a vendor dispute. The dispute lands months after the original PO. The parser has been upgraded twice in the interim. Without the parser-version field on the cell, reproducing the original extraction requires either pinning the old parser binary or guessing. With the parser-version field, the cell carries exactly which extraction code produced it.
