An overlay is a smaller bet, not a smaller architecture

An MRP overlay is the pragmatic answer to the question "our MRP is wrong about planning, what do we do?". Rather than rip and replace the MRP, you wire a layer on top that reads the MRP's data, runs a better planning loop, and either surfaces proposed actions for an operator or writes back to the MRP directly. The bet is that the existing MRP's transactional surface is fine and that only the planning brain needs replacing.

The bet is correct often enough that overlays are a recognised architecture in mid-market manufacturing. We have shipped three of them. We have watched four others get shipped by other teams. Three of those seven failed. Looking at the failure modes side by side, the failures cluster into three recognisable shapes. None of them are tooling problems. All of them are structural choices the overlay made early and could not back out of later.

This piece is about the three failure modes, the structural cause of each, and the specific design moves that prevent them. Read it before you commit to an overlay architecture, not after.

Sink
MRP transactional
Source
Overlay read loop
Stage
Planning logic
Human gate
Operator gate
Sink
Write surface

Figures illustrative · drawn from observed pattern, not a named deployment

An MRP overlay in its simplest shape. The MRP keeps the transactional surface. The overlay reads from the MRP, runs a planning loop, and proposes actions to the operator. The operator approves and the overlay writes back. The three failure modes are at the three places the architecture has freedom: the read loop, the write surface, and the operator interface.

The seven overlays we have observed

Survival rate

43%

overlays still in production after 18 months

three of seven

Time to failure

11 mo

median months from go-live to operator abandonment when an overlay fails

not a quick collapse

Failure attribution

86%

post-mortems that blame the planning algorithm vs. the structural shape

diagnosed wrong

Sample of MRP overlays across discrete and process manufacturing, sized 200-2000 person plants. Three shipped successfully, three failed within 18 months, one is mid-flight. Failure modes cluster into three shapes regardless of the underlying MRP.

Figures illustrative · drawn from observed pattern, not a named deployment

Failure mode 1: planning loop divergence

The first failure mode is the most common and the most recoverable. The overlay's planning loop and the MRP's planning loop both run. They produce different answers. The operator does not know which one to trust. Over time, the operator picks the one whose number is easier to defend in the production meeting, which is usually the MRP's number, because the MRP is the system that finance asks about.

The mechanic is concrete. The MRP runs an MRP regen overnight. It produces a planning schedule for the next 90 days. The overlay subscribes to demand events and supply events and runs a continuous planning loop. It produces a different schedule. The two schedules look at the same demand and the same supply and disagree on the lot sizes, the release dates, and the safety stock posture.

Neither schedule is necessarily wrong. They have different assumptions baked in. The MRP assumes the lot-sizing rule configured in the master data. The overlay assumes a continuous re-optimisation against a cost function that weighs setup time against carrying cost. When the two disagree by 18% on next-week releases, the operator has no principled way to pick.^[1]

The structural cause

Two planning systems with no contract between them

The structural cause is that the overlay was built as a better planner and not as a controller of the MRP's planning. The two systems run in parallel and produce parallel outputs. There is no contract that says "when the overlay disagrees with the MRP, the overlay's answer overrides because of X." The disagreement is permanent and unresolved.

The fix

One planner, not two. The overlay drives the MRP's planning inputs.

The fix is to stop running two planners. The MRP's planning engine is left in place. The overlay's job is to set the planning inputs (lot-sizing rules, safety stock parameters, planning horizons, ABC class definitions) based on whatever continuous analysis the overlay is doing. The MRP regen still runs and produces the single source of truth. The overlay's improvement shows up as better planning inputs going into the regen, not as a parallel planning output.

This change of shape is uncomfortable for overlay vendors who pitched their product as a smarter planner. It is the architecture that survives. The MRP is the planner. The overlay tunes the planner's parameters continuously, which it can do better than the human planner who used to tune them quarterly.

We had two systems telling us two different things every morning. The team picked the one that was easier to explain upstairs. After six months they stopped opening the overlay.

Plant operations manager, post-overlay abandonment

Failure mode 2: write surface drift

The second failure mode is sneakier. The overlay writes back to the MRP. The writes start out matching the MRP's native write semantics. Over months, the overlay accumulates edge cases that the MRP cannot represent natively. The overlay's writes drift away from the MRP's conventions. The MRP's downstream consumers (finance, reporting, integrations to upstream and downstream systems) start producing wrong numbers because they assumed the MRP's native convention.

A specific example. The overlay introduces a concept called "tentative reservation," which represents inventory the planner has earmarked for a work order but has not yet committed to. The MRP has no native concept of a tentative reservation. The overlay encodes it as a regular inventory reservation with a custom flag. Finance's month-end report counts the reservations as committed. The reported inventory position is wrong by the tentative-reservation balance. Nobody notices for three months because the balance is small. By the time it gets large enough to notice, the overlay has introduced four more such concepts and unwinding them is a multi-quarter project.

The structural cause

The overlay invented concepts the MRP cannot represent

The structural cause is that the overlay's richer model exceeded the MRP's expressive capacity. The overlay designer had two choices when they introduced the new concept. Either constrain the concept to fit the MRP's existing types or extend the MRP. They chose a third path: encode the concept in an existing type with a custom flag. The flag is invisible to every consumer of the MRP that did not get told about it.

The fix

Concepts the MRP cannot represent live in the overlay, not in custom MRP flags

The fix is a sharper boundary. If the overlay needs a concept the MRP cannot represent, the concept lives in the overlay's own state and is never written to the MRP. The overlay's writes to the MRP are restricted to the MRP's native semantics. The overlay can decorate the MRP's state in its own UI with the richer concepts, but it does not pollute the MRP record.

For the tentative-reservation example, the overlay tracks tentative reservations in its own table. The MRP sees only the inventory that the overlay has committed. Finance's report is correct against the MRP. The overlay's UI shows the planner the tentative-reservation balance from its own state. No drift, because nothing custom landed in the MRP.

Before · drift accumulates

What the MRP record carries

▭Inventory reservation
flag: tentative=1
▭WO release
flag: provisional=1
▭BOM line
flag: overlay-substitute=1
▭Routing op
flag: scenario=A
▭Finance report
overcounts by flag-balance

After · clean boundary

What lives where

MRP record: native semantics only
Tentative reservations: overlay state · own table
Provisional WOs: overlay state · own table
Overlay UI: merges MRP + overlay views
Finance report: reads MRP · correct
Integration consumers: see MRP · correct
Overlay-only consumers: use overlay API

The MRP record is left in MRP-native shape. The overlay's richer concepts live in the overlay. Drift is impossible because nothing custom ever landed in the MRP.

Figures illustrative · drawn from observed pattern, not a named deployment

Write surface drift, before and after. Before: the overlay extends MRP records with custom flags that downstream consumers do not know about. After: the overlay keeps richer concepts in its own state and only writes to the MRP what the MRP can natively represent.

Failure mode 3: parallel-spreadsheet emergence

The third failure mode is the one that signals the overlay has lost the operators. The overlay was supposed to replace the planner's spreadsheet. Six months in, the planner is still maintaining the spreadsheet, in parallel, by hand. The overlay's output is one of the inputs to the spreadsheet, not the system of action.

This is the failure mode that ops leadership notices last because the spreadsheet is invisible to them. The overlay dashboard shows healthy adoption metrics. Operators are logging in, looking at the screen, and taking actions in the overlay. What the metrics do not show is that the actions in the overlay are being decided in the spreadsheet, then keyed into the overlay to make the metrics look right.

The structural cause

The overlay made the easy 80% easier and the hard 20% harder

The structural cause is almost always the same. The overlay handles the routine planning loop well. It surfaces the straightforward proposals, drives the easy approvals, keeps the obvious things on schedule. What it does not handle is the planner's actual day-to-day work, which is the hard 20%: the changeovers that involve a specialist operator's vacation, the supplier delay that landed at 5pm on a Friday, the customer order that came in at a quantity that breaks the standard lot. The hard 20% lives in the spreadsheet because the overlay never modelled it.

The overlay made the easy work even easier by automating the propose-and-approve cycle. It made the hard work harder by hiding the underlying data behind a model that does not cover the hard cases. The planner can no longer answer their own ad-hoc questions inside the overlay and falls back to the data extract and the spreadsheet.

The fix

Make the overlay's underlying data queryable, even when the surface UI does not handle the case

The fix is to never let the overlay become a black box. The overlay's richer model has to be queryable by the planner directly. Not through the proposals UI. Through a query surface that exposes the overlay's state at the row level, with filters and joins, exportable to a CSV.

The pattern that works: a flexible table view on top of the overlay's state, with the same query language the planner already knows from the spreadsheet. If the planner has to answer an ad-hoc question, they answer it inside the overlay's table view. The proposals UI handles the 80%. The table view handles the 20%. The spreadsheet stops being needed because the overlay carries both.

Failure mode summary

Three failures, three structural causes, three fixes

1
Planning loop divergence
Symptom: two planners, two answers, operator picks the easier-to-defend one. Cause: no contract between the overlay and the MRP planning. Fix: the overlay tunes the MRP's planning inputs, the MRP remains the sole planner.
2
Write surface drift
Symptom: finance and integrations produce wrong numbers that nobody can trace. Cause: the overlay invented concepts the MRP cannot represent and encoded them in custom flags. Fix: custom concepts live in overlay state, MRP writes stay strictly native.
3
Parallel spreadsheet emergence
Symptom: the planner keeps the spreadsheet despite the overlay being live. Cause: the overlay handles the routine 80% well and hides the hard 20% behind a model. Fix: a flexible table view exposes the overlay's state to ad-hoc query, retiring the spreadsheet by absorbing the cases it covered.

Where this hurts most · process manufacturing

Plant manager, specialty chemicals (one site, batch-driven)

Situation: The site ran 14–18 batches per week across three reactors. Yield variance per batch ran 2–5% depending on raw input lot, ambient conditions, and operator routing.
What was breaking: A raw material lot that expired mid-batch broke lot genealogy in the ERP. Margin and yield were quarterly finance exercises rather than live signal. A 3-day inbound slip from a single-source resin vendor reshaped the next 9 batch plans by hand.

Planning + purchasing
Delay recovery
Margin and bottleneck analysis

Outcome · 10 weeks

89%

Weekly batch plan stability

was 62%+27 pts

Illustrative, reflects this specific deployment. Outcomes vary by plant, stack, and scope.

The design checklist we now use

The architectural moves that prevent the three failure modes are not exotic. They are specific choices made early. A short checklist we run before committing to an overlay architecture.

1
One planner is responsible for the schedule the operator acts on
Either the MRP or the overlay is the planner. Not both. If the overlay is the planner, the MRP's planning engine is turned off. If the MRP is the planner, the overlay's job is to set the planning inputs. Two planners is the failure mode.
2
The MRP record contains only what the MRP can natively represent
The overlay can extend the planner's mental model. It cannot extend the MRP's record format. Concepts the MRP cannot represent live in the overlay, exposed through the overlay's UI and API. The MRP's downstream consumers see what the MRP has always shown them.
3
The overlay exposes its full state to ad-hoc query, not just to the proposals UI
The planner can write a query against the overlay's state at the row level. They can export. They can filter and join. The proposals UI is the daily driver. The query surface is the escape hatch that prevents the parallel spreadsheet from emerging.
4
Every write to the MRP is logged with the overlay's reasoning
When the overlay writes to the MRP, the write carries a reference back to the overlay state and the reasoning that produced it. The MRP record itself is native. The audit trail is rich. If a downstream consumer queries why an MRP record changed, the answer is one click away.
5
The operator can override any overlay decision with a recorded reason
The overlay's proposals are proposals. The operator can override. The override is recorded as a structured object. Patterns in the overrides feed back into the overlay's planning logic. Overrides are not exceptions to manage; they are the primary signal that the overlay is missing something.

The fourth failure mode that does not get its own section

There is a fourth failure mode that does not deserve its own section because it is fixable in the first month if anyone is paying attention. The overlay launches with no observability. The planner is making decisions inside it. Nobody outside the overlay can see what is happening. When a problem lands (a missed delivery, a stockout, a customer dispute), the ops manager cannot answer "what did the overlay propose, when did the planner approve, and what changed in the MRP because of it" without interviewing the planner.

The fix is to ship the overlay with a structured event log from day one. Every proposal, every approval, every override, every write to the MRP is an event. The event log is queryable. The post-mortem of any incident starts with a query against the event log, not with an interview with the planner. Skip this and the overlay is treated as a black box by everyone outside the planning team, and you lose budget the first time something goes wrong.

The shape of the bet

An MRP overlay is a smaller bet than an MRP replacement and a bigger architectural commitment than a planning spreadsheet. The three failure modes are predictable. The fixes are specific. Treat them as design constraints, not as remediation playbooks.

The objections we hear most often

Three objections come up when we walk a team through this checklist before they commit to an overlay. None of them are wrong. All of them have specific answers.

The first objection is the planner-quality argument. "The MRP's planner is the reason we wanted an overlay in the first place. Why would we let it stay in charge?" The MRP's planner is in charge of the mechanics of the planning run. The overlay is in charge of the inputs to the planner. The improvement the overlay provides shows up in better inputs (better lot-sizing rules, better safety-stock parameters, better priority weights), which produce a better planning output without requiring the overlay to also be a planner.

The second objection is the data-model argument. "We really do need to extend the MRP record because we need finance to see the new concept." This is the case to examine most carefully. If finance needs to see the new concept, then the new concept is a financial fact, and extending the MRP is the right move (with finance's buy-in on the data-model change and the integration impact). If finance does not need to see it, then the new concept is an operational fact and belongs in the overlay. The mistake is to extend the MRP for operational facts that no MRP consumer was asking for.

The third objection is the operator-confusion argument. "If the overlay exposes the underlying state as a query surface, the operator will use it instead of the proposals UI and the proposals UI will become irrelevant." This is the failure mode in reverse. If the operator is choosing the query surface over the proposals UI, the proposals UI is missing something. The signal is useful. Fix the proposals UI by widening what it surfaces. Do not remove the query surface to hide the signal.

Footnotes

[1]
The 18% disagreement number is from a specific deployment we audited where the overlay and the MRP had been running in parallel for nine months. The disagreement was largest on long-lead items where the overlay's lot-sizing rule diverged most from the MRP's. The operators had developed a heuristic of using the MRP's number for any item with a lead time over six weeks, which was undocumented and discovered only during the audit.

The three failure modes of MRP overlays

An overlay is a smaller bet, not a smaller architecture

Failure mode 1: planning loop divergence

Two planning systems with no contract between them

One planner, not two. The overlay drives the MRP's planning inputs.

Failure mode 2: write surface drift

The overlay invented concepts the MRP cannot represent

Concepts the MRP cannot represent live in the overlay, not in custom MRP flags

What the MRP record carries

What lives where

Failure mode 3: parallel-spreadsheet emergence

The overlay made the easy 80% easier and the hard 20% harder

Make the overlay's underlying data queryable, even when the surface UI does not handle the case

Three failures, three structural causes, three fixes

The design checklist we now use

The fourth failure mode that does not get its own section

The objections we hear most often

Footnotes

Keep reading.

The real cost of a 6-18 month ERP rollout

Human-in-the-loop as a product primitive, not a config flag

Where quote-to-PO actually breaks (and why ERPs do not fix it)