Most vendor scorecards do not survive scrutiny
Ask a mid-market manufacturer for their vendor scoring methodology. You will get one of three answers. The most common is a senior buyer who can list, from memory, the fifteen vendors they actively manage and rank them in their head, with reasons. The second most common is a spreadsheet maintained by procurement that has not been recomputed since the buyer rotated last year. The third is a tier-one ERP report that scores vendors on three fields the ERP captures natively (on-time delivery, quantity accuracy, defect rate) and ignores everything else.
None of the three survive a serious procurement review. The memory case fails because it cannot be audited. The spreadsheet case fails because the inputs are stale. The ERP-report case fails because the scoring is on the wrong fields. None of them encode the actual reasons the senior buyer prefers vendor V-218 over vendor V-244 for coil work in the third quarter, which are roughly: V-218 holds price better when the resin index moves, V-244 has a faster turnaround on the 1.2mm gauge specifically, and V-218's packaging is friendlier to the cell's incoming inspection process.
A scorecard that holds up to a procurement review has to do three things the existing approaches do not. Disclose its weights. Source-link every signal it scores on. And explicitly carry the overrides where the buyer's judgment differs from the score, with reasons.
- →24-month PO + receipt historycontinuouson-time, qty accuracy, lead time variance
- →Incoming inspection ledgercontinuousdefect rate, NCR rate, rework $
- →Payment + dispute historycontinuousinvoice disputes, credit memos
- →Composite score per vendoron-demandweights disclosed, source-linked
- ←Override with reasonas-neededtribal knowledge captured
Figures illustrative · drawn from observed pattern, not a named deployment
The scoring pipeline. Signals fan in from ERP, QA, and finance. The composite score is computed continuously per vendor with the weights disclosed. The buyer can override with a recorded reason. The override stays in the audit record alongside the score.
Sampled across two mid-market metal-fab plants over six months of sourcing decisions. The first two stats are the cost of bad scoring. The third is the disclosure gap when a procurement review lands.
Figures illustrative · drawn from observed pattern, not a named deployment
The three structural failure modes of a vendor scorecard
Before getting to what works, name the failure modes. A bad scorecard fails in one of three recognisable ways, every time.
The score is a number with no audit trail back to inputs
The spreadsheet has a column called Composite and the formula bar shows =0.3*B2 + 0.25*C2 + 0.2*D2 + .... The weights are stored in the formula. They are not documented anywhere else. The buyer who built the spreadsheet knows that 0.3 was chosen because price stability was a board-level concern in the year the spreadsheet was created. The current buyer does not know that. The current buyer reads a composite of 6.4 and acts on it.
When the procurement reviewer asks why vendor V-244 scored lower than V-218, the answer requires walking the weights back from the formula. The walk takes hours, surfaces the fact that the weights were last touched two buyers ago, and leaves the reviewer with no confidence in the composite.
The inputs were last recomputed quarterly, by hand
The on-time delivery field on the scorecard was last updated when procurement ran the quarterly review. The current quarter is half over. V-244 had two late deliveries last month that the spreadsheet does not reflect. The buyer is working from a snapshot of vendor performance that is materially older than the decision they are about to make.
The structural reason this happens is that the signal aggregation is a human task. The ERP has the receipt history. The QA system has the inspection ledger. Finance has the payment history. Joining them into a scorecard requires three downloads, an Excel pivot, and a manual recomputation. Nobody runs that on a Tuesday because of one award decision.
The scorecard has a column called Relationship and nobody knows what is in it
The most damaging failure mode is the soft field. Most scorecards include a column called Relationship or Responsiveness or Strategic fit, scored 1-10 by the buyer. Nobody knows what 7 means. Nobody knows what would have to change for V-244 to move from 6 to 7. The soft field is where the tribal knowledge lives, and the soft field is unauditable.[1]
The standard procurement-review move is to delete the soft field. That makes the scorecard auditable and worse, because the tribal knowledge it was capturing is now captured nowhere. The buyer goes back to overriding from memory and the scorecard becomes ceremonial.
I do not need the scorecard to be right. I need to be able to explain why I picked the vendor I picked. The scorecard that prevents that conversation is worse than no scorecard.
What a scorecard that survives review looks like
The shape of a scorecard that survives a procurement review is not complicated. It has three structural properties. Disclose the weights, source-link the signals, and carry the overrides with reasons. Everything else is the specific metrics for the specific category. The structure is the same.
Every weight is visible and changeable, with the change recorded
The weights are not in a formula. They are in a config that the buyer can read and the procurement reviewer can inspect. When the weights change, the change is recorded with a reason and an effective date. The score for a vendor on a given date is reproducible because the weights for that date are recoverable.
This sounds obvious until you go look at an in-production scorecard. The weights are almost always either hard-coded in a sheet formula or buried in a stored procedure nobody remembers writing. Surfacing them as a first-class config is the single highest-leverage change in the whole scorecard design.
Every score has a click-through to the underlying ledger
The on-time delivery score for V-244 of 88% is not a standalone number. It is a click-through to the list of receipts that produced it: the 47 receipts in the trailing 90 days, the 6 that were late, the calendar days of lateness per receipt. The reviewer can verify the score by walking the underlying receipts. The buyer can dispute the score by flagging a specific receipt as out of scope (for example, a receipt that was late because the buyer rescheduled).
The source-link is what makes the score auditable. It is also what makes the score correctable. Without it, a wrong score is permanent. With it, the next time someone notices the score is wrong, they can trace the bad input and fix it.
Buyer overrides are first-class records, not deleted scores
When the buyer awards V-218 over the higher-scoring V-244, the override is recorded as a structured object. The fields are: which vendor was selected, which vendor the scorecard recommended, the reason from a closed list of categories (price stability, gauge availability, packaging compatibility, payment terms, relationship), and a free-text note. The score itself is not modified. The override sits alongside the score in the audit record.
The override pattern is what captures the tribal knowledge. Six months of overrides accumulate into a corpus that says "for coil work in q3, V-218 is preferred for resin price stability eighty percent of the time." That corpus is now visible to the next buyer. It can also feed back into the scorecard design (perhaps price stability should be its own scored signal rather than appearing only in overrides).
What the report knows
- ▭V-244 on-time88%
- ▭V-244 qty accuracy97%
- ▭V-244 defect rate1.2%
- ▭Weightsundocumented
- ▭Soft fieldrelationship: 7
- ▭Last updateQ1 2026
What an ops layer maintains
- On-time
- 88% · 47 receipts · 6 late
- Qty accuracy
- 97% · 47 receipts · 2 short
- Defect rate
- 1.2% · 3 NCRs · click-through
- Price stability
- +/- 2.8% vs index
- Weights config
- on-time 0.3 · qa 0.3 · price 0.4
- Last computed
- continuous · 12 min ago
- Overrides
- 2 in trailing 90 days, both for gauge
- Audit trail
- full · per-signal click-through
The composite is the same kind of number. The difference is the disclosure, the freshness, and the override capture.
Figures illustrative · drawn from observed pattern, not a named deployment
What an ERP-native vendor report contains vs. what a survives-review scorecard contains. The ERP report is correct as far as it goes. The scorecard adds disclosed weights, source-linked signals across three systems, and a structured override record that captures the buyer's judgment instead of erasing it.
- Situation
- Inbound customer RFQs arrived as a one-line email with a PDF drawing attached. Estimating, engineering, and purchasing each opened the drawing separately, often two business days apart.
- What was breaking
- The average quote-to-PO loop was four business days. One-off BOMs were rebuilt from scratch every time because nothing in the ERP keyed off a customer drawing. Half of quotes landed outside the +/-8% margin band the operations VP enforced.
- BOM extraction + RFQ
- Quote-to-procure
- Engineering revisions
Five signals worth scoring (and three that look good but are noise)
Vendor scoring is one of those problems where the right signals are unglamorous and the popular signals are misleading. A short list of what to actually score and what to drop.
Score against the vendor's acknowledged date, not the originally requested date
On-time delivery is the obvious signal and the one most often mis-scored. The score should be on-time against the vendor's acknowledged date, not against the buyer's original requested date. Vendors who push back on requested dates and then hit their acknowledged dates are reliable. Vendors who accept every requested date and miss half of them are not. The ERP report usually scores against the requested date, which rewards the wrong behaviour.
Score short and over shipments at the line level, not the order level
A PO with five lines where one is short and one is over is not 100% accurate just because the totals net out. Score per line. Track the absolute deviation, not the signed one. Vendors who routinely send over-shipments to make up for past short-shipments are gaming the order-level score and costing you incoming-inspection time.
Score how the vendor's price moves relative to a published commodity baseline
This is the signal most scorecards miss. Vendors who hold price within the band of the underlying commodity index are valuable in a way that does not show up in any standard scorecard field. Vendors who pass through every micro-move in the index are passing the index volatility to you. Vendors who lag the index on the way up but pass it on the way down are net-extractive. None of this is visible without scoring against a baseline.
Score how long the vendor takes to respond to a quote request, not just whether they respond
A vendor who responds in 24 hours with a tight quote is worth more than a vendor who responds in 6 days with the same number. The 24-hour vendor lets you close the buyer's decision in the same week the requirement landed. The 6-day vendor stretches your sourcing calendar. Track median and p95 response time. The p95 matters because the outlier responses are the ones that stretch a sourcing decision into the next month.
Score how long it takes the vendor to close out an open non-conformance, not how often one is opened
Defect rate is a noisy signal because category and process variation swamp it. NCR resolution time is a much cleaner signal of vendor responsiveness. A vendor with a 1.4% defect rate who closes NCRs in 8 days is better than a vendor with a 0.8% defect rate who closes NCRs in 6 weeks. The cost of the defect is in the resolution time, not in the opening rate.
And the three that look good but are noise
See above
The relationship field is captured as overrides with structured reasons, not as a number out of 10. Score on things that are measurable; capture the rest as override reasons.
A large spend does not mean a good vendor
Annual revenue with the vendor shows up on most scorecards as "strategic spend" or "volume tier." It is noise. The vendor you spend the most with is the one you have the most exposure to, not necessarily the one you should keep buying from. Track it as a context field on the vendor profile, not as a scored signal.
Cert tiers are a hurdle, not a signal
ISO 9001, AS9100, IATF 16949, whatever the cert is in your category. The cert is a hurdle the vendor either has or does not. Once they have it, the cert says nothing discriminating between vendors. Use it as a gate, not as a scored signal. Scoring it inflates composites for the vendors who happen to have certified processes for reasons unrelated to your category.
The four specific changes Polymr makes
When a plant's vendor scoring moves to an operations layer, four concrete things change. None of them require replacing the ERP or the QA system.
- 1Signals are pulled continuously from the source systems, not snapshot quarterlyOn-time, quantity accuracy, NCR rate, payment terms, response time, and price stability all refresh continuously against the ERP, QA, finance, and inbox streams. The buyer reads a score that is current to the hour, not current to the last quarterly review.
- 2Weights are first-class config with an effective-date historyThe scoring weights are stored as a versioned config. Changes require a recorded reason and an effective date. The composite score for a vendor on a given date is reproducible because the weights for that date are recoverable from the version history.
- 3Every score field is click-through to the underlying ledgerOn-time delivery clicks through to the list of receipts that produced it. Defect rate clicks through to the list of NCRs. Price stability clicks through to the commodity baseline and the vendor's quoted prices over time. The buyer and the reviewer can both verify the score by walking the inputs.
- 4Overrides are recorded with a structured reason and persist alongside the scoreWhen the buyer awards against the scorecard recommendation, the override is captured with a structured reason from a closed list and an optional free-text note. The score is not modified. Six months of overrides become a queryable corpus that surfaces patterns the weighted score is not catching.
- ERP receiptsPO + lateness
- QA inspectionsNCR ledger
- Financepayment + disputes
- RFQ inboxresponse times
- Commodity indexbaseline curve
- Scoring layerweighted aggregator
- Vendor cardclick-through UI
- Override recordstructured reason
- Audit trailreproducible score
Figures illustrative · drawn from observed pattern, not a named deployment
The signal aggregation pipeline. ERP receipts, QA inspections, finance ledger, RFQ response times, and commodity baselines all feed the scoring layer continuously. The layer publishes per-vendor composites with click-through to inputs. The buyer's overrides are written back to the audit record, not the score.
What this looks like in week one
The first deployment is narrow. Pick one category that has had a sourcing decision in the past month that the buyer is still defending. Wire the ERP receipts, QA NCRs, and the buyer's inbox to the scoring layer. Compute the composite for the vendors in that category. Walk the composite with the buyer side-by-side with the decision they actually made. If the composite disagrees with the decision, that is the interesting case. Either the composite is missing a signal (drives a new scored field) or the buyer is making a judgment that should have been an override.
Within two weeks, the buyer is reading scorecards before sourcing decisions instead of reconstructing them after the fact for procurement reviews. The reviewer is walking the composite, click-through, and override history in a single session instead of asking for a binder pull.
Vendor scoring is not the place to be clever about algorithms. It is the place to be precise about which signals are scored, what the weights are, where the data came from, and where the buyer overrode the score. Get those four things right and the scorecard survives review.
The objections we hear most often
Three objections come up consistently. None of them are wrong. All of them have specific answers.
The first objection is the weights argument. "Different categories need different weights. One scorecard does not fit all." This is true. The layer supports weight configs per category. Coil weights look different from fastener weights look different from machined-housing weights. The disclosure property still holds per category. Each category has its own weights, each weights config has its own history, each composite is reproducible.
The second objection is the signal-availability argument. "Some of these signals are not available in our QA system or our finance system." That is also true. The layer can run with whichever signals are available and declare the rest as not-scored. The composite is the weighted average of the signals that did report. The buyer and the reviewer can see which signals are missing. A signal that is missing for a vendor is more useful as an explicit gap than as a zero or a default value.
The third objection is the override-fatigue argument. "If buyers have to fill in a reason for every override, they will pick a default reason and stop thinking." This is the failure mode to watch for. The mitigation is twofold. Keep the reason list short (five categories, not twenty). Periodically review the overrides for patterns that should become scored signals (if 60% of overrides cite price stability, price stability should be a scored signal). The override is a feedback loop on the score, not a permanent escape hatch.
Footnotes
- [1]The soft field is the place a procurement reviewer will most reliably find the scorecard is being used as a justification for predetermined awards. A field scored 1-10 with no rubric for what each number means is a rubber stamp. Reviewers know to look for it. The buyer who maintains the scorecard usually does not know that the reviewer knows to look for it.
