Inventory Control and Supply Chain Optimization With Reinforcement Learning

Inventory control is the problem of attempting to optimize product or stock levels given the unique constraints and requirements of a business. It is an important problem because every goods-based business has to spend resources on maintaining stock levels so that they can deliver products that customers want. Every improvement to inventory control has a direct improvement the delivery of the business. Beginners study tactics, experts study logistics, so they say.

Reinforcement Learning (RL) is a sub-discipline of machine learning that optimizes repeated, sequential decisions for a global business-centric goal. RL is often seen in challenges such as robotics, pricing, and recommendations, but it is particularly suited to automated, optimal inventory control and supply chain management.

Common Problems in Inventory Control

Even moderately large retail businesses often stock approximately 100,000 different products that could be spread over a large geographical area, potentially composed of 1,000’s of physical stores. Warehouses or distribution hubs may act as a temporary store of products. The goal of such business is to move goods to where they are needed most. But purchases depend on a heady mix of demand, availability, and complex global context.

A complex chain exists even for businesses that don’t directly sell to consumers, that may include any of shipping, distribution, vendors and suppliers. Any of these can cause both upstream (your suppliers) and downstream (your distributors) issues that result is the loss of sales.

Complex Hierarchical Routing

Warehouses and stores are restocked by a wide range of transportation mechanisms including by road, rail, sea or air. There are multiple trade-offs in this process that fundamentally depend on current stock levels, how quickly items perish, demand, and capacity. Supply chains attempt to manage stock levels to prevent over-stocking, which costs money and can increase waste, and prevent under-stocking, which limits sales. A problem arises when any of these are sub-optimal.

Uncoordinated Supply Chains

Even within the bounded context of a single organization, stock management is a difficult challenge. But many business often depend on the inventory of other third-party companies. A lack of coordination can cause a “bull whip” effect, which occurs when the change in retailer demand ripples back up through the supply chain.

Solving Inventory Control with Reinforcement Learning

At small scales, mixed-integer linear programming can be used to model the joint stocking requirements and constraints within a business, but this is limited to small numbers of products (often less than 10) and short time-horizons. Practical implementations must use degenerate heuristics or demand assumptions.

Data-driven methods like model-predictive control and dynamic programming are on potential solution because they iterate towards optimal stock levels. However, these methods require either perfect knowledge of the stock movement at all times (i.e. you’d need to know when a customer was going to buy) or at least the transition probabilities of products. In most situations neither of these are available and at best rely on heuristics.

Reinforcement learning (RL) is a sub-discipline of machine learning (ML) that optimizes problems that involve multiple, sequential decisions to arrive at an optimal strategy, or policy. In other words RL actively learns models to describe product movement throughout the network of retailers, warehouses and suppliers and derives an optimal stocking strategy given a set of constraints and goals. RL is considered to be superior to the previous methods becase it is assumes less and learns more, which makes it more generally applicable.

Modelling Stores/Warehouses as RL Agents

One potential RL modelling approach is to imagine that each store and warehouse is a distinct RL agent, like in Sultana et al. Each store agent then repeatedly interacts with the warehouse to replenish stock based upon current inventory levels, forecasted sales, estimated restocking delays, and predicted wastage. The warehouse operates with similar inputs and outputs but interacts with notional suppliers. Rewards can be designed independently or jointly to maximize the desired goal, like a constant level, for example.

Multi-Agent, Cooperative RL Agents

One extension to the previous idea is to use multi-agent, cooperative RL, where all agents have full observability of all other agents. The benefit of this approach is that agents can leverage the information contained in other agents.

For example, one problem with the previous approach is that it takes time for stores to learn optimal policies. If there is a dramatic event, like COVID-19 for example, it can take time for stores to retrain their policies for the new environment. During this time the warehouse is attempting to learn on sub-optimized behaviour.

A naive solution is to slow or delay the learning of the warehouse until store behaviour has stabilized. But this is regretful (in the technical sense of the word).

An alternative approach, if feasible, is to allow the warehouse to peer inside the state of the stores. This perfect information should reduce training time.

Multi-Agent, Cooperative, Imperfect Knowledge

In many supply chains you do not have visibility into another’s inventory, but the supply chain benefits from having cooperation. For example, it is in a grower’s best interests to maximize the efficiency of a supply chain so that they can sell more produce. It is a boon for retailers too, because they don’t want to over or under-stock.

In this case it is possible to treat each entity as an agent and optimize for not only local conditions (e.g. over or under-stocking) but also globally observable metrics like the total number of products sold, similar to what is described in Oroojlooyjadid et al.

The primary benefit of this approach is that it mitigates the “bull whip” effect, because each level of the supply chain has a representative model of the chain above and below.

Industrial Applications

Global retailers like Zara have been using ML to optimize their inventory control problems for a decade. More recently, organizations like American Airlines have looked towards RL to provide solutions for optimizing seat overbooking.

RL is quickly becoming the tool of choice because it is able to solve complex inventory control problems that were previous infeasible.

Recent work to provide better simulators, combined with more efficient algorithms like those that learn offline or use guidance, mean the barrier to entry is lower than ever.

How to Develop an RL-Powered Solution

You might find yourself thinking that designing, developing and deploying an RL-driven solution is daunting. True, RL solutions are difficult, relying on expertise in ML, software engineering and RL, but with help it is achievable. We wrote a book all about using reinforcement learning in business, and we’ve experienced this through our work.

In theory, RL is no different to any other software engineering project. It is possible to plan and schedule work as such. But similar to a data science project, you need to retain some time to deal with risks and failures. Overly-complex and poorly defined goals are always the first issue, but assumptions about the simplicity of the project or the availability of data can also cause major problems.

In RL, specifically, we find that more time needs to be invested in building out the simulation to derisk the operational deployment. And a lot of design time is required to plan roll-outs that don’t negatively impact the business. Because of this, RL projects tend to be larger investments over a longer period of time. This means that RL is best suited to high-impact problems.

We write about this more in our book, but the simplest and quickest solution is to give us a call.

Inventory Control and Supply Chain Optimization with Reinforcement Learning