Master Thesis — Mariano Cánovas

The Problem

Forecasts don't reduce cost — decisions do

Intermittent demand is characterized by randomness, frequent zero observations, and irregular spikes, a pattern found in slow-moving consumer goods, spare parts, and industrial maintenance products. Poor inventory control in such contexts can create serious cost implications, since overstocking ties up capital and raises obsolescence risk, while understocking results in stockouts and lost sales. Inventory policies rely on demand estimates, so demand forecasting plays a crucial role.

Yet prior research shows that improvements in forecast accuracy do not always translate into better inventory performance, and this disconnect can also appear under intermittent demand. What matters is not the forecast itself but the ordering decision it informs. This thesis therefore evaluates Croston-based and LightGBM-based forecasts not by accuracy alone but by the holding and shortage costs of the decisions they drive within a simulation framework.

The guiding question: how do different ways of forecasting intermittent demand translate into inventory costs, and does the benefit of a more sophisticated forecast depend on the policy that acts on it?

Central contribution

To separate the forecaster effect from the policy effect on the same data, under one cost objective, instead of studying forecasting and ordering apart, and to provide one of the few tests of a learned neural policy on explicitly intermittent demand, showing that its cost advantage is largest there.

The Setup

A 2 × 2 design on the VN2 Challenge

Using the VN2 Inventory Planning Challenge dataset (Vandeput, 2025), which recreates a real retail replenishment setting with weekly orders and a two-week lead time, this thesis pairs two forecasters with two ordering policies and tests all four combinations under a common cost objective. A classical Croston (SBA) forecaster and a modern LightGBM forecaster are each fed into a classical base-stock benchmark and a feed-forward neural network (FFNN) policy trained directly on simulated inventory cost via hindsight differentiable policy optimization.

Evaluating both forecaster and policy on the same data isolates and quantifies the contribution of each to total inventory cost. Orders are placed in each of the first six weeks; because of the two-week lead time, the simulation runs eight weeks so orders from weeks 5 and 6 can arrive and have their costs counted. The objective is to minimise total holding plus shortage cost, where a stockout ($1.00/unit) is five times more costly than holding ($0.20/unit/week).

Series

599 SKUs

67 stores × 297 products, ~165 weeks of history

Cost Asymmetry

5 : 1

Shortage $1.00 vs holding $0.20 per unit

Lead Time

2 Weeks

Fixed; in-transit stock incurs no holding cost

Sparsity

85% Sparse

Intermittent or lumpy; the median SKU shows zero demand in 46.5% of weeks

Key Decisions

How the four methods were built

A core data problem shapes everything. While the VN2 dataset flags whether a product was in stock (FALSE in roughly 11% of observations), it does not record how much demand was lost during those stockouts. Demand is never observed directly; only sales are observed, and sales and lost demand are not independent, they are two parts of the same quantity separated by the stockout. Since the lost quantity cannot be recovered, this thesis makes no attempt to reconstruct it. Instead, stockout weeks are masked from the LightGBM target, and the resulting downward bias is accepted as an inherent limitation of the data rather than corrected through ad hoc reconstruction.

The neural policy is deliberately framed to be comparable to the classical rule. The base-stock order max(0, S − IP) has exactly the structure of a ReLU activation, so the base-stock rule is a constrained special case of the neural policy within the same piecewise-linear function class, meaning the network can, in principle, replicate it exactly. Any performance difference therefore stems from the additional flexibility that the neural policy provides.

The four combinations tested

Croston (SBA) + Base-stock — the fully classical reference point
Croston (SBA) + FFNN — classical forecast, learned policy
LightGBM + Base-stock — modern forecast, classical policy
LightGBM + FFNN — modern forecast, learned policy (cheapest overall)

Result 1

LightGBM forecasts better — but only just

Measured by MASE on the test window, LightGBM wins in every cell of the table, at both horizons and in all four demand classes. Pooled over all series, it lowers MASE from 1.071 to 1.003 at t+1 and from 1.006 to 0.957 at t+2, improving around 6% and 5%. The difference is largest on the lumpy and intermittent classes, which together make up 512 of the 599 series, because LightGBM can read calendar covariates and learn across series while Croston cannot capture the year-end spike or the mild trend.

But the more telling observation is how close most values are to one, the naive benchmark, especially in the dominant intermittent and lumpy classes. This indicates that both forecasters improve only a bit on the naive benchmark, so there is little accuracy signal to pass into cost. In short, LightGBM is the more accurate forecaster here, but only by a small margin.

Forecast Accuracy (MASE, t+1) by demand class — lower is better, 1.0 = naive benchmark

Result 2

Accuracy barely carries through to cost

This is the heart of the thesis. Reading the cost table down each column isolates the policy effect; reading across each row isolates the forecaster effect. Under base-stock, swapping Croston for LightGBM lowers cost from 5,298 to 5,197, about 2%, even though Table 7 shows a 6% gain in pooled MASE. Keeping the forecaster fixed and replacing base-stock with the learned policy cuts cost from 5,197 to 4,707 on LightGBM, about 9%. On Croston the gain is small, because a flat demand rate leaves the network a limited structure to work with.

Total cost (8 weeks)	Base-stock	FFNN (learned)
Croston (SBA)	5,298	5,227
LightGBM	5,197	4,707

FFNN values are the median across five random seeds (spread 62 and 78). The lowest cost, 4,707, comes from pairing the stronger forecaster with the learned policy; the distance from there to the classical Croston baseline of 5,298 owes far more to the policy than to the forecaster.

The core finding

Forecast accuracy and inventory performance move together but are not the same thing. A ~6% accuracy gain became only a ~2% base-stock cost cut, while changing the ordering rule delivered ~9%. On its own, the forecaster matters less than the policy.

Result 3

The ranking is robust

The results above rest on a single 8-week horizon. Re-running every combination on six rolling test windows leaves the pattern the same: the learned policy lowers cost under both forecasters, again more so with LightGBM (6,624 to 5,735, about 13%) than with Croston (6,718 to 6,002, about 11%), and the cheapest combination is still LightGBM with the FFNN. Across five random seeds, seed variation is small (CV = 0.010 for Croston, 0.012 for LightGBM), far below the gap of more than 600 cost units between the learned policy and base-stock, so every seed outperforms the benchmark. Re-running without masking stockout weeks leaves the ranking intact as well.

Rolling windows · LightGBM

~13%

FFNN cuts mean cost 6,624 → 5,735

Rolling windows · Croston

~11%

FFNN cuts mean cost 6,718 → 6,002

Seed stability

CV ≈ 0.01

Every seed outperforms base-stock

Why the policy wins

More state

The FFNN sees on-hand, both in-transit, and both forecasts separately — not one collapsed number

Why It Wins

It's about timing, not holding less

A single illustrative series (Store 61, Product 48) makes the mechanism concrete. Both policies hold a similar amount of inventory on average, so the difference is not that one carries less inventory, it is about timing. While base-stock lets its inventory fall to zero across Weeks 160 and 161, losing sales, the learned policy keeps stock available through those weeks. The cumulative cost gap begins with the stockouts and persists until the end of the evaluation horizon.

The learned policy wins because it has more state and more flexibility than base-stock. Base-stock is a special case it can replicate, and the network sees more: it observes on-hand inventory, both in-transit quantities, and both forecasts separately, and maps them to an order non-linearly. This is why the policy helped more on LightGBM than on Croston, the policy pays off most when the forecast carries real week-to-week information.

Honest Limits

What qualifies these findings

The aim was not to win the competition but to separate the effects of the forecaster and the policy, ensuring a fair comparison. The leading entries reached about €3,763, compared to the competition benchmark of €4,334 that only around 15% of participants beat; the best result here was €4,707, and the costs were not meant to be the best possible.

Limitations

Censored demand: the VN2 data does not show how much demand was lost, so every forecaster was fitted on a censored series, underestimating true demand; the bias is accepted, not corrected
Deliberate simplicity: a default squared-error loss, hand-set hyperparameters, and a shallow two-layer FFNN, all kept simple for a fair comparison rather than peak accuracy
A forecasting ceiling: on the intermittent and lumpy classes both forecasters stayed close to the naive benchmark, part of why the policy effect looks larger than the forecaster effect
Generalisation: the finding may depend on the VN2 cost ratio, which heavily rewards avoiding stockouts; other ratios, lead times, or backorder rules would need further testing
Transparency: the learned policy needs a differentiable simulator and gives no readable ordering rule, so its small saving may not always justify the added complexity

In Practice

What this means for planners

For intermittent retail demand, the practical takeaway is direct: planners facing intermittent demand may gain more by improving the rule that turns forecasts into orders than by chasing forecast accuracy alone, especially when stockouts are costly. Future work points to generating synthetic demand for stockout weeks rather than masking them, repeating the comparison across other cost ratios and lead times, and removing the train-in-stages split by making the forecaster cost-aware or training the forecaster and policy together directly on cost.

Bridging Forecasting & Inventory Control

The policy matters more than the forecast

Forecasts don't reduce cost — decisions do

A 2 × 2 design on the VN2 Challenge

How the four methods were built

LightGBM forecasts better — but only just

Accuracy barely carries through to cost

The ranking is robust

It's about timing, not holding less

What qualifies these findings

What this means for planners

The policy matters more than the forecast

Forecasts don't reduce cost — decisions do

A 2 × 2 design on the VN2 Challenge

How the four methods were built

LightGBM forecasts better — but only just

Accuracy barely carries through to cost

The ranking is robust

It's about timing, not holding less

What qualifies these findings

What this means for planners

Master Thesis PDF