Most vendor scorecards do not survive scrutiny

Ask a mid-market manufacturer for their vendor scoring methodology. You will get one of three answers. The most common is a senior buyer who can list, from memory, the fifteen vendors they actively manage and rank them in their head, with reasons. The second most common is a spreadsheet maintained by procurement that has not been recomputed since the buyer rotated last year. The third is a tier-one ERP report that scores vendors on three fields the ERP captures natively (on-time delivery, quantity accuracy, defect rate) and ignores everything else.

None of the three survive a serious procurement review. The memory case fails because it cannot be audited. The spreadsheet case fails because the inputs are stale. The ERP-report case fails because the scoring is on the wrong fields. None of them encode the actual reasons the senior buyer prefers vendor V-218 over vendor V-244 for coil work in the third quarter, which are roughly: V-218 holds price better when the resin index moves, V-244 has a faster turnaround on the 1.2mm gauge specifically, and V-218's packaging is friendlier to the cell's incoming inspection process.

A scorecard that holds up to a procurement review has to do three things the existing approaches do not. Disclose its weights. Source-link every signal it scores on. And explicitly carry the overrides where the buyer's judgment differs from the score, with reasons.

ERP

PO + receipt history

incoming inspection

Finance

payment terms

Scoring layer

weighted aggregator

Buyer

sourcing decision

→24-month PO + receipt history
continuous
on-time, qty accuracy, lead time variance
→Incoming inspection ledger
continuous
defect rate, NCR rate, rework $
→Payment + dispute history
continuous
invoice disputes, credit memos
→Composite score per vendor
on-demand
weights disclosed, source-linked
←Override with reason
as-needed
tribal knowledge captured

Figures illustrative · drawn from observed pattern, not a named deployment

The scoring pipeline. Signals fan in from ERP, QA, and finance. The composite score is computed continuously per vendor with the weights disclosed. The buyer can override with a recorded reason. The override stays in the audit record alongside the score.

What the gap costs

Sourcing decisions

62%

award decisions made against a scorecard older than 90 days

stale by definition

Override rate

34%

awards that overrode the spreadsheet recommendation

reason recorded: 9%

Audit disclosure

18%

vendors whose composite weights the buyer could explain to a procurement reviewer in one sitting

rest required reconstruction

Sampled across two mid-market metal-fab plants over six months of sourcing decisions. The first two stats are the cost of bad scoring. The third is the disclosure gap when a procurement review lands.

Figures illustrative · drawn from observed pattern, not a named deployment

The three structural failure modes of a vendor scorecard

Before getting to what works, name the failure modes. A bad scorecard fails in one of three recognisable ways, every time.

Failure mode 1 · undisclosed weights

The score is a number with no audit trail back to inputs

The spreadsheet has a column called Composite and the formula bar shows =0.3*B2 + 0.25*C2 + 0.2*D2 + .... The weights are stored in the formula. They are not documented anywhere else. The buyer who built the spreadsheet knows that 0.3 was chosen because price stability was a board-level concern in the year the spreadsheet was created. The current buyer does not know that. The current buyer reads a composite of 6.4 and acts on it.

When the procurement reviewer asks why vendor V-244 scored lower than V-218, the answer requires walking the weights back from the formula. The walk takes hours, surfaces the fact that the weights were last touched two buyers ago, and leaves the reviewer with no confidence in the composite.

Failure mode 2 · stale signals

The inputs were last recomputed quarterly, by hand

The on-time delivery field on the scorecard was last updated when procurement ran the quarterly review. The current quarter is half over. V-244 had two late deliveries last month that the spreadsheet does not reflect. The buyer is working from a snapshot of vendor performance that is materially older than the decision they are about to make.

The structural reason this happens is that the signal aggregation is a human task. The ERP has the receipt history. The QA system has the inspection ledger. Finance has the payment history. Joining them into a scorecard requires three downloads, an Excel pivot, and a manual recomputation. Nobody runs that on a Tuesday because of one award decision.

Failure mode 3 · the soft field

The scorecard has a column called Relationship and nobody knows what is in it

The most damaging failure mode is the soft field. Most scorecards include a column called Relationship or Responsiveness or Strategic fit, scored 1-10 by the buyer. Nobody knows what 7 means. Nobody knows what would have to change for V-244 to move from 6 to 7. The soft field is where the tribal knowledge lives, and the soft field is unauditable.^[1]

The standard procurement-review move is to delete the soft field. That makes the scorecard auditable and worse, because the tribal knowledge it was capturing is now captured nowhere. The buyer goes back to overriding from memory and the scorecard becomes ceremonial.

The honest moveThe soft field is the right idea executed wrong. Replace it with structured overrides. The override is the buyer saying "I am picking V-218 even though V-244 scored higher, because of X." The X is recorded. The override is auditable. The tribal knowledge is captured without pretending to be a number.

I do not need the scorecard to be right. I need to be able to explain why I picked the vendor I picked. The scorecard that prevents that conversation is worse than no scorecard.

Procurement director, mid-market metal-fab, post-review

What a scorecard that survives review looks like

The shape of a scorecard that survives a procurement review is not complicated. It has three structural properties. Disclose the weights, source-link the signals, and carry the overrides with reasons. Everything else is the specific metrics for the specific category. The structure is the same.

Property 1 · disclosed weights

Every weight is visible and changeable, with the change recorded

The weights are not in a formula. They are in a config that the buyer can read and the procurement reviewer can inspect. When the weights change, the change is recorded with a reason and an effective date. The score for a vendor on a given date is reproducible because the weights for that date are recoverable.

This sounds obvious until you go look at an in-production scorecard. The weights are almost always either hard-coded in a sheet formula or buried in a stored procedure nobody remembers writing. Surfacing them as a first-class config is the single highest-leverage change in the whole scorecard design.

Property 2 · source-linked signals

Every score has a click-through to the underlying ledger

The on-time delivery score for V-244 of 88% is not a standalone number. It is a click-through to the list of receipts that produced it: the 47 receipts in the trailing 90 days, the 6 that were late, the calendar days of lateness per receipt. The reviewer can verify the score by walking the underlying receipts. The buyer can dispute the score by flagging a specific receipt as out of scope (for example, a receipt that was late because the buyer rescheduled).

The source-link is what makes the score auditable. It is also what makes the score correctable. Without it, a wrong score is permanent. With it, the next time someone notices the score is wrong, they can trace the bad input and fix it.

Property 3 · overrides with reasons

Buyer overrides are first-class records, not deleted scores

When the buyer awards V-218 over the higher-scoring V-244, the override is recorded as a structured object. The fields are: which vendor was selected, which vendor the scorecard recommended, the reason from a closed list of categories (price stability, gauge availability, packaging compatibility, payment terms, relationship), and a free-text note. The score itself is not modified. The override sits alongside the score in the audit record.

The override pattern is what captures the tribal knowledge. Six months of overrides accumulate into a corpus that says "for coil work in q3, V-218 is preferred for resin price stability eighty percent of the time." That corpus is now visible to the next buyer. It can also feed back into the scorecard design (perhaps price stability should be its own scored signal rather than appearing only in overrides).

ERP vendor report

What the report knows

▭V-244 on-time
88%
▭V-244 qty accuracy
97%
▭V-244 defect rate
1.2%
▭Weights
undocumented
▭Soft field
relationship: 7
▭Last update
Q1 2026

Survives-review scorecard

What an ops layer maintains

On-time: 88% · 47 receipts · 6 late
Qty accuracy: 97% · 47 receipts · 2 short
Defect rate: 1.2% · 3 NCRs · click-through
Price stability: +/- 2.8% vs index
Weights config: on-time 0.3 · qa 0.3 · price 0.4
Last computed: continuous · 12 min ago
Overrides: 2 in trailing 90 days, both for gauge
Audit trail: full · per-signal click-through

The composite is the same kind of number. The difference is the disclosure, the freshness, and the override capture.

Figures illustrative · drawn from observed pattern, not a named deployment

What an ERP-native vendor report contains vs. what a survives-review scorecard contains. The ERP report is correct as far as it goes. The scorecard adds disclosed weights, source-linked signals across three systems, and a structured override record that captures the buyer's judgment instead of erasing it.

Where this hurts most · metal fabrication

Estimating lead, custom industrial fabricator

Situation: Inbound customer RFQs arrived as a one-line email with a PDF drawing attached. Estimating, engineering, and purchasing each opened the drawing separately, often two business days apart.
What was breaking: The average quote-to-PO loop was four business days. One-off BOMs were rebuilt from scratch every time because nothing in the ERP keyed off a customer drawing. Half of quotes landed outside the +/-8% margin band the operations VP enforced.

BOM extraction + RFQ
Quote-to-procure
Engineering revisions

Outcome · 5 weeks

0.6days

Quote-to-PO cycle

was 4.1 days−85%

Illustrative, reflects this specific deployment. Outcomes vary by plant, stack, and scope.

Five signals worth scoring (and three that look good but are noise)

Vendor scoring is one of those problems where the right signals are unglamorous and the popular signals are misleading. A short list of what to actually score and what to drop.

Signal 1 · on-time delivery, by promise window

Score against the vendor's acknowledged date, not the originally requested date

On-time delivery is the obvious signal and the one most often mis-scored. The score should be on-time against the vendor's acknowledged date, not against the buyer's original requested date. Vendors who push back on requested dates and then hit their acknowledged dates are reliable. Vendors who accept every requested date and miss half of them are not. The ERP report usually scores against the requested date, which rewards the wrong behaviour.

Signal 2 · quantity accuracy, per line

Score short and over shipments at the line level, not the order level

A PO with five lines where one is short and one is over is not 100% accurate just because the totals net out. Score per line. Track the absolute deviation, not the signed one. Vendors who routinely send over-shipments to make up for past short-shipments are gaming the order-level score and costing you incoming-inspection time.

Signal 3 · price stability against an index

Score how the vendor's price moves relative to a published commodity baseline

This is the signal most scorecards miss. Vendors who hold price within the band of the underlying commodity index are valuable in a way that does not show up in any standard scorecard field. Vendors who pass through every micro-move in the index are passing the index volatility to you. Vendors who lag the index on the way up but pass it on the way down are net-extractive. None of this is visible without scoring against a baseline.

Signal 4 · response time on RFQs

Score how long the vendor takes to respond to a quote request, not just whether they respond

A vendor who responds in 24 hours with a tight quote is worth more than a vendor who responds in 6 days with the same number. The 24-hour vendor lets you close the buyer's decision in the same week the requirement landed. The 6-day vendor stretches your sourcing calendar. Track median and p95 response time. The p95 matters because the outlier responses are the ones that stretch a sourcing decision into the next month.

Signal 5 · NCR resolution time

Score how long it takes the vendor to close out an open non-conformance, not how often one is opened

Defect rate is a noisy signal because category and process variation swamp it. NCR resolution time is a much cleaner signal of vendor responsiveness. A vendor with a 1.4% defect rate who closes NCRs in 8 days is better than a vendor with a 0.8% defect rate who closes NCRs in 6 weeks. The cost of the defect is in the resolution time, not in the opening rate.

And the three that look good but are noise

Noise 1 · the soft relationship field

See above

The relationship field is captured as overrides with structured reasons, not as a number out of 10. Score on things that are measurable; capture the rest as override reasons.

Noise 2 · annual revenue with the vendor

A large spend does not mean a good vendor

Annual revenue with the vendor shows up on most scorecards as "strategic spend" or "volume tier." It is noise. The vendor you spend the most with is the one you have the most exposure to, not necessarily the one you should keep buying from. Track it as a context field on the vendor profile, not as a scored signal.

Noise 3 · the vendor's self-reported quality cert tier

Cert tiers are a hurdle, not a signal

ISO 9001, AS9100, IATF 16949, whatever the cert is in your category. The cert is a hurdle the vendor either has or does not. Once they have it, the cert says nothing discriminating between vendors. Use it as a gate, not as a scored signal. Scoring it inflates composites for the vendors who happen to have certified processes for reasons unrelated to your category.

The four specific changes Polymr makes

When a plant's vendor scoring moves to an operations layer, four concrete things change. None of them require replacing the ERP or the QA system.

1
Signals are pulled continuously from the source systems, not snapshot quarterly
On-time, quantity accuracy, NCR rate, payment terms, response time, and price stability all refresh continuously against the ERP, QA, finance, and inbox streams. The buyer reads a score that is current to the hour, not current to the last quarterly review.
2
Weights are first-class config with an effective-date history
The scoring weights are stored as a versioned config. Changes require a recorded reason and an effective date. The composite score for a vendor on a given date is reproducible because the weights for that date are recoverable from the version history.
3
Every score field is click-through to the underlying ledger
On-time delivery clicks through to the list of receipts that produced it. Defect rate clicks through to the list of NCRs. Price stability clicks through to the commodity baseline and the vendor's quoted prices over time. The buyer and the reviewer can both verify the score by walking the inputs.
4
Overrides are recorded with a structured reason and persist alongside the score
When the buyer awards against the scorecard recommendation, the override is captured with a structured reason from a closed list and an optional free-text note. The score is not modified. Six months of overrides become a queryable corpus that surfaces patterns the weighted score is not catching.

baselineoverride

ERP receipts
PO + lateness
QA inspections
NCR ledger
Finance
payment + disputes
RFQ inbox
response times
Commodity index
baseline curve
Scoring layer
weighted aggregator
Vendor card
click-through UI
Override record
structured reason
Audit trail
reproducible score

Figures illustrative · drawn from observed pattern, not a named deployment

The signal aggregation pipeline. ERP receipts, QA inspections, finance ledger, RFQ response times, and commodity baselines all feed the scoring layer continuously. The layer publishes per-vendor composites with click-through to inputs. The buyer's overrides are written back to the audit record, not the score.

What this looks like in week one

The first deployment is narrow. Pick one category that has had a sourcing decision in the past month that the buyer is still defending. Wire the ERP receipts, QA NCRs, and the buyer's inbox to the scoring layer. Compute the composite for the vendors in that category. Walk the composite with the buyer side-by-side with the decision they actually made. If the composite disagrees with the decision, that is the interesting case. Either the composite is missing a signal (drives a new scored field) or the buyer is making a judgment that should have been an override.

Within two weeks, the buyer is reading scorecards before sourcing decisions instead of reconstructing them after the fact for procurement reviews. The reviewer is walking the composite, click-through, and override history in a single session instead of asking for a binder pull.

The shape of the bet

Vendor scoring is not the place to be clever about algorithms. It is the place to be precise about which signals are scored, what the weights are, where the data came from, and where the buyer overrode the score. Get those four things right and the scorecard survives review.

The objections we hear most often

Three objections come up consistently. None of them are wrong. All of them have specific answers.

The first objection is the weights argument. "Different categories need different weights. One scorecard does not fit all." This is true. The layer supports weight configs per category. Coil weights look different from fastener weights look different from machined-housing weights. The disclosure property still holds per category. Each category has its own weights, each weights config has its own history, each composite is reproducible.

The second objection is the signal-availability argument. "Some of these signals are not available in our QA system or our finance system." That is also true. The layer can run with whichever signals are available and declare the rest as not-scored. The composite is the weighted average of the signals that did report. The buyer and the reviewer can see which signals are missing. A signal that is missing for a vendor is more useful as an explicit gap than as a zero or a default value.

The third objection is the override-fatigue argument. "If buyers have to fill in a reason for every override, they will pick a default reason and stop thinking." This is the failure mode to watch for. The mitigation is twofold. Keep the reason list short (five categories, not twenty). Periodically review the overrides for patterns that should become scored signals (if 60% of overrides cite price stability, price stability should be a scored signal). The override is a feedback loop on the score, not a permanent escape hatch.

Footnotes

[1]
The soft field is the place a procurement reviewer will most reliably find the scorecard is being used as a justification for predetermined awards. A field scored 1-10 with no rubric for what each number means is a rubber stamp. Reviewers know to look for it. The buyer who maintains the scorecard usually does not know that the reviewer knows to look for it.

Vendor scoring that survives a procurement review

Most vendor scorecards do not survive scrutiny

The three structural failure modes of a vendor scorecard

The score is a number with no audit trail back to inputs

The inputs were last recomputed quarterly, by hand

The scorecard has a column called Relationship and nobody knows what is in it

What a scorecard that survives review looks like

Every weight is visible and changeable, with the change recorded

Every score has a click-through to the underlying ledger

Buyer overrides are first-class records, not deleted scores

What the report knows

What an ops layer maintains

Five signals worth scoring (and three that look good but are noise)

Score against the vendor's acknowledged date, not the originally requested date

Score short and over shipments at the line level, not the order level

Score how the vendor's price moves relative to a published commodity baseline

Score how long the vendor takes to respond to a quote request, not just whether they respond

Score how long it takes the vendor to close out an open non-conformance, not how often one is opened

And the three that look good but are noise

See above

A large spend does not mean a good vendor

Cert tiers are a hurdle, not a signal

The four specific changes Polymr makes

What this looks like in week one

The objections we hear most often

Footnotes

Keep reading.

Where quote-to-PO actually breaks (and why ERPs do not fix it)

Human-in-the-loop as a product primitive, not a config flag

The real cost of a 6-18 month ERP rollout