- Executive Summary
- Introduction to RL and DeepRL
- Problem Context and Challenges
- RL-based Solution Approach
- Scenarios and Sensitivity Analysis
- Example Python Code Implementation
Executive Summary
Retailers face complex decisions in assortment planning – determining the optimal mix of products to stock across stores and channels. Traditional methods struggle to account for changing customer preferences, demand uncertainty, and the myriad factors influencing product performance. Reinforcement Learning (RL) offers a data-driven, dynamic approach to optimize assortments in real time.
By learning from continuous feedback (sales, stock levels, customer choices), RL agents can adaptively ensure “the right products are available at the right stores,” improving customer satisfaction and sales (100+ Real-Life Examples of Reinforcement Learning And Challenges). Major retailers (e.g. Tesco) have leveraged RL for assortment planning with notable success. RL’s ability to consider long-term rewards means it can balance short-term sales with strategic goals like customer loyalty and inventory efficiency. Businesses adopting RL-driven assortment optimization report benefits such as reduced stockouts and excess inventory, lower labor costs, and improved profitability (Daisy Intelligence | AI-Powered Assortment Optimization for Retail) (Daisy Intelligence | AI-Powered Assortment Optimization for Retail).
From a strategic perspective, RL empowers retailers to respond rapidly to market changes (trends, disruptions) and discern complex patterns in big data that may elude human planners (Unraveling Demand Forecasting Challenges in Retail | Progressive Grocer). In summary, RL-based assortment planning enhances decision-making by autonomously exploring myriad assortment scenarios and continuously refining decisions for maximum long-term ROI.
Introduction to RL and DeepRL
Reinforcement Learning is a branch of machine learning focused on sequential decision-making. An RL agent interacts with an environment by taking actions in given states and learning from the consequences via rewards. Unlike supervised learning, which learns from static labeled data, RL learns by trial-and-error to maximize cumulative rewards (100+ Real-Life Examples of Reinforcement Learning And Challenges). This makes RL well-suited for dynamic retail settings where decisions (assortment choices) have delayed and interconnected effects.
Modern RL encompasses various algorithms and techniques, often categorized as model-free (learning purely from trial-and-error) or model-based (incorporating an environment model). Below, we outline key RL and Deep Reinforcement Learning (DeepRL) techniques, highlighting their mechanics and suitability for assortment optimization:
Q-Learning
- A foundational model-free RL algorithm that learns an action-value function Q(s,a). It uses the Bellman equation to iteratively update Q-values toward optimal values. At its core, Q-learning updates the value of a state-action pair based on received reward plus the estimated value of the best future action (Q-learning – Wikipedia). By converging on Q-values, an agent can select the highest-value action in each state (greedy policy). Q-learning is off-policy (it learns the optimal policy regardless of how actions are chosen during learning, often via an \(\epsilon\)-greedy strategy for exploration). While Q-learning is powerful for smaller discrete problems, in assortment planning the state (e.g. inventory levels, time) and action (combinatorial choice of products) spaces can be enormous, making tabular Q-learning infeasible.
Deep Q Network (DQN)
- DQN is a DeepRL extension of Q-learning that approximates the Q-function with a neural network (The Deep Q-Learning Algorithm – Hugging Face Deep RL Course). Instead of a table of Q-values, a deep Q-network takes a state as input and outputs Q-values for all possible actions. During training, DQN uses a loss function comparing the Q-network’s prediction to a target Q-value (reward + discounted max future Q), and uses gradient descent to minimize this loss (The Deep Q-Learning Algorithm – Hugging Face Deep RL Course). DQN introduced practical tricks to stabilize training: an experience replay buffer to break correlation in training data, and a target network to provide stable Q-value targets (The Deep Q-Learning Algorithm – Hugging Face Deep RL Course). These innovations allow DQN to learn good policies even in high-dimensional state spaces (famously achieving human-level play in Atari games). For assortment optimization, a DQN could in theory handle moderate action spaces (e.g. choosing from a small set of predefined assortment configurations) and learn from simulated sales outcomes. However, if the action involves combinatorial selection of many products, the number of possible actions may be too large for a DQN to output a Q-value for each. In such cases, policy-based methods can be more practical.
Policy Gradient Methods
Instead of learning values for actions, policy gradient algorithms optimize the policy directly. They parameterize the policy \(\pi_\theta(a|s)\) (probability of taking action a in state s) with a set of parameters \(\theta\), often as a neural network, and then adjust \(\theta\) to maximize expected reward. The Policy Gradient Theorem provides a way to compute the gradient of expected return with respect to \(\theta\). In essence, policy gradient methods perform gradient ascent on the performance objective by nudging the policy to favor actions that yield higher returns. REINFORCE (Monte Carlo policy gradient) is a classic example that updates policy parameters in the direction of \(\nabla_\theta \log \pi_\theta(a|s) \cdot G\) (where \(G\) is the total return from that time). Advantages of policy gradients include the ability to handle high-dimensional or continuous action spaces and to naturally produce stochastic policies (What Is Policy Gradient In Reinforcement Learning | Restackio). This makes them attractive for assortment problems – e.g. an agent could output probabilities of including each product, allowing a mixture of assortments over time. Policy gradients can also incorporate domain-specific constraints (by shaping the policy network outputs) more flexibly than value-based methods.
Actor-Critic Methods
These hybrid approaches combine value-based and policy-based learning. The actor is a policy that selects actions, and the critic estimates value (often the value function V(s) or action-value Q(s,a)) to critique the actor’s choices. During training, the actor’s policy parameters are updated via policy gradient, using feedback from the critic in the form of an advantage term (how much better an action was than expected). The critic is trained via a value loss (e.g. temporal difference error). Actor-critic algorithms (like A2C, DDPG, etc.) often achieve lower variance and more stable learning than pure policy gradient, while retaining the ability to handle complex action spaces (What Is Policy Gradient In Reinforcement Learning | Restackio). For assortment planning, an actor-critic approach can be very powerful: the actor (policy network) could output a recommended assortment, and the critic (value network) evaluates the long-term value of the current state under that decision. This combination can stabilize training in a high-dimensional setting (many products and states) and is well-suited to large-scale problems. In fact, a recent study on assortment optimization employed a modified Advantage Actor-Critic (A2C) algorithm coupled with a specialized assortment-generation network (Deep Reinforcement Learning for Online Assortment Customization: A Data-Driven Approach | Request PDF).
Model-Based RL
- In model-based methods, the agent has or learns a model of the environment’s dynamics (predicting next state and reward given a state-action). This model allows the agent to plan by simulating outcomes – essentially “look ahead” to evaluate potential action sequences (Part 2: Kinds of RL Algorithms — Spinning Up documentation). For example, if a retailer had a reliable simulator of customer purchasing behavior under different assortments, a model-based RL agent could test different strategies in simulation before applying them. The big advantage of model-based RL is sample efficiency – by leveraging the model, the agent can learn optimal policies with fewer real-world trials, which is valuable when real experiments are costly. AlphaGo and AlphaZero are famous examples where planning with a learned model dramatically improved performance (Part 2: Kinds of RL Algorithms — Spinning Up documentation). However, learning an accurate model of a complex retail environment is challenging. Inaccurate models can mislead the agent into optimizing against a wrong objective (model bias), resulting in policies that perform well in simulation but poorly in reality (Part 2: Kinds of RL Algorithms — Spinning Up documentation). Due to these challenges, model-free methods (which directly learn the policy or value without an explicit model) are more common in practice, especially when a high-fidelity simulator is not available. For assortment planning, if a retailer has a known demand model (e.g. a parametric customer choice model), a model-based approach could be considered – the agent could use the model to simulate “what-if” scenarios for different assortments. But in many cases, customer behavior is too complex to perfectly model upfront, so a model-free deep RL that learns directly from data may be preferable.
Choosing the Best RL Approach for Assortment Optimization
Each technique has pros and cons, and the ideal choice depends on the problem characteristics and business requirements:
- The size of action space is a key factor. If we treat an assortment as selecting a subset of products from hundreds of options, the action space is combinatorial and enormous. Value-based methods like Q-learning/DQN struggle here because enumerating all possible assortments and learning a value for each is intractable. Policy-based methods (actor-critic) can handle this by parameterizing the selection process (e.g. producing a probability for each product being included). Actor-Critic approaches are thus favored for large action spaces, as they can output structured actions and be guided by a critic for stability.
- Sample efficiency vs. model accuracy: If historical data or a simulator is available to emulate customer responses, a model-free deep RL can be trained in a reasonable timeframe. A model-based method could achieve more efficient learning by using that simulator internally, but any mismatch between the simulator and reality could degrade performance. In assortment planning, where demand models have uncertainty and external shocks occur, a model-free approach that continually learns and adapts from real data might be more robust. Recent research confirms that a model-free DRL policy can achieve near-optimal performance for dynamic assortment with unknown customer preferences (Deep Reinforcement Learning for Online Assortment Customization: A Data-Driven Approach | Request PDF), validating this choice.
- Need to handle long-term rewards and constraints: Assortment decisions impact not just immediate sales but future outcomes (e.g. selling a product now means inventory is gone for later, or introducing a new item may build future loyalty). Policy gradient and actor-critic methods inherently optimize for expected long-term return, making them suitable to capture these trade-offs. Additionally, business constraints (e.g. shelf space limits, must-have items) can be incorporated into the action logic or reward. The cited study introduced a Gated-DNN network to generate assortments that adhere to business constraints (like assortment size limits), with an actor-critic adjusting its parameters (Deep Reinforcement Learning for Online Assortment Customization: A Data-Driven Approach | Request PDF). This exemplifies how architecture can be tailored to the problem.
Considering these criteria, a Deep RL approach using an Actor-Critic algorithm emerges as an excellent choice for assortment optimization. It directly optimizes the policy (which is needed for combinatorial action spaces) and leverages a value critic for stable, efficient learning (What Is Policy Gradient In Reinforcement Learning | Restackio). This approach was successfully applied in a 2023 assortment planning study, which used a modified A2C algorithm and achieved significantly higher long-term revenue than traditional methods (Deep Reinforcement Learning for Online Assortment Customization: A Data-Driven Approach | Request PDF). In the following sections, we delve into how such an RL solution can be formulated and implemented for the assortment planning problem.
Problem Context and Challenges
Assortment planning in retail is inherently complex due to a multitude of factors and uncertainties. Before designing an RL solution, it’s important to understand the problem context and key challenges that the solution must address:
Demand Forecasting Accuracy
Predicting consumer demand for each product is notoriously difficult. Retail demand is influenced by seasonality, trends, holidays, weather, promotions, and unpredictable events. Even with modern analytics, accurate forecasts are difficult to achieve (Unraveling Demand Forecasting Challenges in Retail | Progressive Grocer) – daily sales can vary widely around the average, and unusual events (a heat wave, a viral social media trend) can render forecasts inaccurate. Traditional assortment planning often relies on forecasted sales to choose the product mix, which can lead to suboptimal outcomes if the forecasts are off. A classic challenge is the “long tail” of products: some slow-moving items may sell near zero most days but occasionally spike; deciding whether to include them is tricky when forecasts are uncertain. RL can help here by not requiring a perfect forecast upfront – the agent can learn from actual sales outcomes and adjust the assortment dynamically, effectively performing continuous demand sensing.
Cannibalization and Complementarity Effects
Products in the assortment can interact with each other in demand. Adding a new product might cannibalize sales of a similar existing product – customers substitute one for the other – resulting in no net gain. Conversely, certain products have a halo effect, where promoting or including one item boosts sales of another (e.g. burger buns and patties). These cross-product effects complicate planning. For example, if two products are close substitutes and one is heavily discounted, it will likely steal sales from the other (Cannibalization and Halo Effects in Demand Forecasts | RELEX Solutions), a classic cannibalization scenario. Ignoring these interactions (as simple models often do) leads to suboptimal assortments and missed opportunities (Cannibalization and Halo Effects in Demand Forecasts | RELEX Solutions). An RL agent, however, can learn these relationships by observing reward feedback when different combinations of products are offered. It can infer, for instance, that having both Product A and B (substitutes) in the assortment doesn’t yield as much total sales as expected, so it might choose to drop one in favor of a more complementary item. Incorporating cannibalization effects is essential for realistic assortment optimization.
Supply Chain and Logistics Disruptions
Assortment decisions are tightly coupled with supply chain reliability. A perfectly planned assortment is futile if products aren’t available on time. Modern retailers face frequent disruptions: supplier delays, stockouts at distribution centers, transportation problems, or global crises (like COVID-19) that upend supply lines. Such disruptions can force sudden assortment changes (e.g. if a certain brand is temporarily unavailable, the store must fill the shelf with alternatives). Traditional planning cycles (often seasonal or annual) react slowly to these issues. RL can introduce adaptability – an agent continuously monitoring inventory levels and lead times can preemptively adjust the assortment. For instance, if the agent’s state includes supply status, it can learn to temporarily substitute a product that has become scarce, mitigating lost sales. Additionally, RL can be used upstream: companies like Cisco use RL for supply chain risk management to anticipate and mitigate disruptions (100+ Real-Life Examples of Reinforcement Learning And Challenges). For our context, the key challenge is uncertainty in product availability, which the RL solution must be robust against (perhaps by favoring more reliable suppliers or maintaining a diversified assortment to hedge against one item’s shortage).
Price Fluctuations, Markdowns, and Promotions
Pricing strategy heavily influences assortment performance. A product might sell slowly at full price but become a top seller on discount. Promotions (sales events, coupons, buy-one-get-one) can temporarily alter demand patterns – sometimes boosting the promoted item but cannibalizing others (Cannibalization and Halo Effects in Demand Forecasts | RELEX Solutions), other times drawing additional traffic to the store (halo effect). Similarly, end-of-season markdowns or limited-time collaborations (e.g. a designer collection at a fashion retailer) inject short-lived demand surges. These fluctuations pose a challenge: the optimal assortment during a promotion week might differ from a normal week. Planners must decide which items to include knowing that relative demand will shift when prices change. RL can naturally incorporate such dynamics by treating price or promotion status as part of the state. The agent could learn different policies for high-price vs low-price periods, effectively optimizing the assortment conditioned on current pricing tactics. However, this adds complexity – the state and reward must capture revenue trade-offs, and the agent needs exposure to these scenarios in training. One must also consider that promotions and pricing could themselves be optimized (in fact, RL is also applied to dynamic pricing (100+ Real-Life Examples of Reinforcement Learning And Challenges)), which suggests a future direction of joint price-assortment optimization. In our scope, we treat pricing as an exogenous factor that makes demand volatile, and the RL agent’s challenge is to remain effective across these fluctuations (e.g. not over-commit to a product that only sells when heavily discounted).
- Seasonality and Collaborations: Retail assortments often change seasonally (winter vs summer products, holiday-themed items) and include experimental items (capsule collections, brand collaborations). These factors mean the assortment optimization problem is non-stationary: the “best” set of products in summer is not the best in winter. Collaborations (say a new celebrity-endorsed product line) can cause unpredictable spikes in demand, but only for a short period. The RL agent must be able to adapt to shifting demand patterns over time – a policy learned on last quarter’s data might need recalibration when a new season or trend arrives. This can be handled by retraining the agent periodically or by online learning where the agent continues to update its knowledge. The challenge is ensuring the RL approach is flexible and doesn’t overfit to the past. Techniques like using recent data with higher weight, or having state variables that indicate season or trend status, can help the agent recognize context shifts. Essentially, the agent needs to discern when the environment has changed (e.g. holiday season starting) and adjust its decisions accordingly.
In summary, assortment planning must juggle uncertain demand, product interdependencies, and operational constraints. Success is measured by metrics like sales, profit, inventory turnover, and customer satisfaction. A robust solution must forecast implicitly or explicitly, handle substitution effects, respond to supply issues, and adjust to market dynamics. These challenges set the stage for an RL-based solution – one that can learn a policy to make near-optimal assortment decisions in the face of uncertainty and complexity that would overwhelm manual planning. The next section describes how we formulate the assortment planning problem for an RL agent and design a solution to tackle these challenges.
RL-based Solution Approach
To apply Reinforcement Learning to assortment optimization, we first need to formulate the decision problem in RL terms: define the environment in which the agent operates, what constitutes the state, what actions the agent can take, and how to measure success through a reward function. We also outline how an RL agent can be trained for this task and what model architecture is appropriate.
Formulating Assortment Planning as an MDP
We model the retail setting as a Markov Decision Process (MDP), where each time step corresponds to a decision point (e.g. a day, or an incoming customer, depending on granularity). The agent (representing the retailer’s decision-making system) observes the current state of the store and chooses an assortment (action) to offer, then receives a reward (sales/profit) and transitions to a new state (after consumer purchases and inventory updates). Over a horizon (e.g. a season or a series of customer arrivals), the agent collects rewards; the goal is to maximize the long-run cumulative reward (which correlates with total revenue or profit).
State Space
The state should capture all relevant information needed to make an assortment decision. A comprehensive state representation may include:
- Inventory levels of each product (how many units remain in stock). This is crucial because it’s pointless to include an out-of-stock item in the assortment. The agent needs to know current stock to avoid wasted slots and to manage inventory strategically (e.g. if an item is low in stock, maybe save it for a high-value customer or promote alternatives).
- Historical sales or demand context. This could be aggregated features like sales velocity of each item in recent periods, or indicators of demand trends. It helps the agent gauge which products are hot sellers vs slow movers at the current time.
- External factors such as the current time period (season, day of week), ongoing promotions, and even local contextual info (weather, events). These factors can shift demand and thus should influence assortment decisions. For instance, state might include a flag for “holiday season” or a feature for “current discount rate” if a promotion is running.
- Customer-specific information (if decisions are made per customer interaction, like an online setting). In an online store, the state could include customer attributes or segment, so the agent can personalize the assortment shown. In a physical store context, the agent’s decisions might be periodic (e.g. set assortment for the week) rather than per customer, so customer-level data would be folded into aggregated demand predictions. For simplicity, one can think of the state in a store-based scenario as
\((I_1, I_2, \dots, I_N; t)\)
where \(I_i\) is the current inventory for product \(i\) (out of \(N\) total products under consideration) and \(t\) indicates the time period or season. This state will evolve as customers make purchases (inventory decreases) and as time moves forward.
Action Space
An action in this context is to choose an assortment of products to offer. This could mean selecting a subset of products to stock/display for the next period. In an online setting, an “assortment” might be the set of product options shown to a customer in a recommendation panel. The action space is essentially all possible subsets of products (subject to some constraints like a maximum number to carry due to shelf space). This is a huge space combinatorially (exponential in number of products), so we often impose structure:
- The retailer might limit the assortment size to \(K\) out of \(N\) products. Then an action can be represented by a binary vector of length \(N\) with \(K\) ones (included items) and \(N-K\) zeros (excluded). Another representation is an index selecting one of the
possible combinations. For example, if \(N=100\) and \(K=20\), the action is “pick these 20 products out of 100.”
- In some formulations, the action could be more granular: rather than fully resetting the assortment, the action might be “which product to introduce next” or “which product to discontinue.” That is a sequential decision that over multiple steps leads to an assortment change. However, a one-shot selection each period is simpler for modeling.
- If pricing or ordering decisions are also included, the action could be a vector that includes those (e.g. set prices for products or order quantities), but here we focus on the selection aspect. It’s important to note that not all actions are feasible at all times. If a product is out of stock, including it in the assortment yields no benefit (unless it can be back-ordered, but assume not). The agent either needs to implicitly learn not to select unavailable items (because it gets no reward) or we restrict the action space dynamically to only in-stock items. In our implementation later, we ensure the agent is aware of inventory in state and gets zero reward for offering a stockout, which naturally discourages that action. Overall, the action is the set of product IDs to offer in the assortment for the coming interval.
Reward Function
The reward must reflect the business objective. A logical reward is profit (or revenue) obtained from the sales during that time step after choosing the assortment. For example, if the agent picks an assortment and, as a result of customer arrivals, sells 50 units across those products yielding $500 profit, the reward could be 500 for that step. In a per-customer decision scenario, the reward could be the revenue from that single customer (or 0 if they purchased nothing). Key considerations for reward design:
- We may want to incorporate costs: If carrying too many products has a cost (handling, shelf space), the reward could subtract such costs to encourage an efficient assortment, not just maximal sales. For instance, a reward might be (sales revenue – holding cost of inventory – markdown costs).
- If we care about metrics like customer satisfaction or variety, we might include a proxy in the reward. However, most often, revenue or profit serves as a good proxy since satisfying assortment usually translates to higher sales.
- The reward is delayed and cumulative. Selling an item now might prevent selling it later due to inventory depletion. RL naturally accounts for this through future rewards: if selling out a popular item too quickly leads to zero reward later when customers can’t find it, the future reward loss will teach the agent to moderate. This long-term view is a strength of RL – it will learn to stock or allocate dynamically to maximize total-season revenue, not just immediate sales.
Putting it together, at each time step the RL agent sees the state (inventory levels, time, etc.), chooses a subset of products as the assortment (action), then customers interact with that assortment. The environment (which simulates customer behavior) returns a reward equal to total sales/profit from that assortment and updates the state (inventory reduced, time advanced). The episode could last for a fixed number of periods (e.g. a season of 12 weeks, or 1000 customer arrivals). At the end, we may apply some terminal reward or simply end with the sum of per-period rewards.
Dynamic Optimization via RL
This RL formulation enables dynamic assortment optimization. Unlike a static plan that might say “carry items A, B, C for the whole season,” the RL policy can change the assortment in response to the state. For example:
- If a certain product is selling much faster than expected (inventory dropping quickly), the RL agent can observe that in state and decide to introduce a substitute product into the assortment before the popular item stocks out completely, thus maintaining sales momentum. Traditional planning might have let the popular item stock out and left an empty shelf, losing sales until replenishment.
- If demand for a category is low, the agent might replace some of those items with other categories’ products that have higher demand, effectively reallocating shelf space on the fly.
- In anticipation of a promotion (say next week there’s a planned discount on product X), the agent might include complementary items now to build awareness, or ensure enough alternatives are available if product X runs out during the promo rush.
- When a supply disruption hits (state shows no inventory coming for product Y), the agent can remove Y from the assortment and fill in with available options, preventing prolonged empty slots.
These kinds of adaptive behaviors are hard to achieve with manual rules but emerge naturally from an RL policy aimed at maximizing reward. The policy learned is essentially a mapping from state to optimal assortment choice, capturing strategies like “if inventory of A is low, include B as backup” or “if we’re late in the season with excess stock of C, push C to avoid markdown later.” This adaptability is a major value-add: the assortment is optimized continuously and conditionally, not just set once.
Training Methodology
Training an RL agent for assortment planning can be done in simulation or on historical data:
Simulation-Based Training
We create an environment simulator that mimics customer behavior and operational dynamics. This simulator could use historical demand distributions, perhaps a choice model (e.g. Multinomial Logit) calibrated on data to simulate how customers choose among offered products. The RL agent interacts with this simulator, trying different assortment strategies and learning from the rewards. Over many episodes, it converges to a good policy. This is essentially how one would train in practice: develop a “digital twin” of the retail environment. The cited research by Li et al. (2023) followed this approach, using simulated interactions with sequences of historical customer arrivals to train the agent (Deep Reinforcement Learning for Online Assortment Customization: A Data-Driven Approach | Request PDF). By replaying actual transaction sequences in simulation, the agent learns a policy that fits real-world patterns.
Offline Learning from Historical Data
In some cases, you might directly use past data of decisions and outcomes (if available) to train an agent – this veers into offline RL. For example, if a retailer has tried various assortment configurations in the past and recorded sales, an RL algorithm can try to infer an optimal policy from this batch of data. This is more challenging (off-policy learning with limited data), but emerging techniques in offline RL could be applied. However, purely offline approaches run the risk of not exploring new strategies, so simulation or live exploration is often needed to discover truly optimal policies.
Exploration vs Exploitation in Retail
A practical training consideration is that random exploration (as done in standard RL) might be costly in a real store (offering a very odd assortment could hurt sales). That’s why simulation is invaluable – the agent can explore freely in a simulated world without hurting real revenue. Once trained, the learned policy can be deployed with confidence that it’s near-optimal. Some systems might also do limited A/B tests in live stores (multi-armed bandit style experiments) to fine-tune the policy or adapt to local specifics.
Model Architecture
We choose a Deep Neural Network architecture for the agent’s decision policy (and value function, if using actor-critic). The design might look like:
State Encoding
The raw state (e.g. inventory vector) can be directly fed into a neural network. If the state includes categorical elements (like season or store type), those can be one-hot encoded or embedded. For very large numbers of products, one might use an embedding layer or a convolution over product attributes, but for moderate N, a simple feed-forward input layer works.
Policy Network
In a policy gradient or actor-critic approach, the policy network takes the state and outputs a distribution over possible actions. For assortment selection, one approach is to have the network output a score for each product (how desirable it is to include), and then a separate mechanism (like a top-K selection or a sigmoid activation for each) to construct an assortment. The Gated-DNN in the research essentially did something similar – ensuring the network’s output respects assortment size constraints (Deep Reinforcement Learning for Online Assortment Customization: A Data-Driven Approach | Request PDF). For example, the network could have \(N\) outputs (one per product) with values between 0 and 1 indicating inclusion probability; then we take the top \(K\) products by that score as the action. This is a non-differentiable selection, but techniques like perturbation or softmax approximations can allow gradients to flow. Alternatively, one can parametrize the policy to directly output a discrete action choice (an index in the list of possible combinations) – feasible if the action space has been pruned to a manageable set of candidates (e.g. a few hundred pre-optimized assortments).
Value Network
For actor-critic, we also have a network (which can share layers with the policy network) to output V(s) – the estimated value (expected future reward) of a given state. This critic helps compute the advantage \(A(s,a)\) to reduce variance in policy updates.
Training Algorithm
We likely use a variant of policy gradient training. For example, an A2C (Advantage Actor-Critic) algorithm: the agent runs through episodes in the simulator, collects rewards, and at certain intervals (or at episode end) computes the policy gradient and value loss to update the networks. The loss function \(\mathcal{L}\) would combine a policy loss (e.g. \(-\log \pi_\theta(a|s) \cdot \hat{A}\)) encouraging actions with positive advantage, a value loss (MSE between predicted V(s) and the observed return), and possibly an entropy bonus to encourage exploring diverse assortments initially. We iterate this process for many episodes.
- Hyperparameters: We must tune learning rate, discount factor \(\gamma\) (how much future rewards are weighed vs immediate – in retail, \(\gamma\) might be close to 1 since we care about entire season outcomes), and exploration degree. In practice, careful tuning is needed – one study noted that RL can match state-of-art inventory policies but hyperparameter tuning for each instance is needed (Deep Reinforcement Learning for Online Assortment Customization: A Data-Driven Approach | Request PDF). Techniques like grid search or Bayesian optimization can be used to find good hyperparameters for the environment at hand.
- Constraints Handling: If there are hard constraints (e.g. exactly K products must be chosen, or certain must-stock items), the architecture or action logic needs to enforce them. This might mean customizing the action sampling (choose top-K outputs as mentioned, or include must-haves always). Alternatively, constraints can be encoded in the reward as severe penalties for violating, but hard constraints are usually better handled explicitly to narrow the action space the agent considers.
Summary of the Solution Approach
We create an RL agent that observes the store’s state (inventory, demand signals, etc.) and decides which products to stock or show at each decision point. The agent’s goal is to maximize cumulative profit. We train this agent in a controlled environment (simulation based on historical data or known models). Over time, the agent learns an assortment policy: a mapping from any given situation to an optimal or near-optimal assortment decision. Notably, this policy can be non-intuitive but effective – for example, it might learn to occasionally include a seemingly low-performing product because it prevents customers who can’t find their preferred item from leaving empty-handed (capturing additional revenue that a greedy strategy would miss). The deep learning component enables the agent to handle large state spaces and complex patterns (like non-linear demand interactions between products). By the end of training, we expect the agent to have “discovered” strategies that balance the assortment well, e.g. maintaining variety to avoid cannibalization, prioritizing high-movers while in stock, and exploiting substitute products when needed, all in an automated fashion.
In the next section, we will explore how the RL model behaves under various scenarios and validate its decisions through sensitivity analyses.
Scenarios and Sensitivity Analysis
A critical aspect of developing an RL solution is ensuring it performs robustly under a wide range of real-world scenarios. Retail markets are dynamic and uncertain, so we conduct extensive what-if analyses and stress tests on the trained RL model. Below we consider several scenarios and discuss how the RL-based assortment policy would handle them, as well as how we can simulate these scenarios for evaluation:
Scenario 1: Sudden Demand Surge for a Subset of Products
Imagine a scenario where, due to a viral social media trend, demand for product A and B doubles overnight, deviating from historical patterns. Traditional static plans might under-stock these items and miss out on sales. An RL agent, by design, can adapt mid-season. Upon observing unusually rapid sales (inventory depletion) of A and B, the agent’s state changes (low inventory levels, high recent sales). The learned policy likely responds by allocating more “slots” in the assortment to similar products or ensuring replacements for A and B are ready when they run out. We would test this by simulating an environment where at a certain time step, the customer choice model parameters for A and B increase. We’d compare the RL policy’s total reward to that of a fixed policy (one that doesn’t change the assortment). We expect the RL policy to quickly react – e.g. it might start featuring product C (a substitute for A) once A’s stock is nearly gone, capturing customers who still demand that category. This adaptability can be quantified by measuring the lost sales avoided by RL. In simulation, the RL agent might yield significantly higher revenue than a static assortment when a demand surge occurs, demonstrating robustness to demand volatility.
Scenario 2: Supply Disruption and Recovery
Consider that product X’s supplier faces an outage for 2 weeks. Product X normally is a strong seller, but during those 2 weeks, no new stock arrives, effectively capping what can be sold. A naive strategy might continue to allocate shelf space to X even when it’s mostly out of stock (resulting in empty shelves and lost sales). The RL agent, however, has inventory levels in state; once it sees X’s inventory hitting zero and perhaps a delay in replenishment (if the state includes incoming deliveries), it will learn to swap X out of the assortment and replace it with another product that can utilize that space. After supply recovers (stock available again), the agent can reintroduce X if it’s still optimal. To analyze this, we simulate a stock-out shock: in the environment, we freeze replenishment for X for certain steps. We’d observe the actions of the RL policy – ideally, it drops X during the shortage and brings it back later. We might also simulate not just one product but a scenario like a pandemic lockdown where supply of many items is erratic and demand shifts (e.g. higher demand for essentials, lower for luxury goods). The RL policy’s performance under such multi-faceted stress can be compared to a rule-based policy (like “always keep assortment constant”). The outcome to measure is total profit through the disruption period and recovery speed. A well-trained RL agent should prove resilient, maintaining higher sales by dynamically reallocating assortment. Indeed, research has found that such RL approaches “remain robust under various conditions” (Deep Reinforcement Learning for Online Assortment Customization: A Data-Driven Approach | Request PDF), indicating they handle scenario variations effectively.
Scenario 3: Pricing War or Major Promotion
Suppose a competitor launches a big sale, or the retailer itself runs an aggressive promotion on a subset of products. This might drastically alter the relative sales of products (price-sensitive items jump in demand, others might slow). An RL agent that was trained on typical conditions might need to cope with this temporary change. We can simulate a period where certain products’ reward contributions per sale are lower (if margins drop due to markdown) or where certain products get an external demand boost. The agent’s behavior might shift to favor higher-margin items if one category’s profitability drops. For example, if the retailer halves prices (thus margins) on electronics for a week, the RL agent might allocate slightly less shelf space to those items (because reward per sale is lower) and use some space for other categories to ensure profit optimization. Conversely, if a promotion is expected to draw in more traffic overall, the agent might stock more of everything (if inventory allows) to capitalize on the surge. Sensitivity analysis here would involve toggling the reward model (to simulate margin changes) and demand model (to simulate traffic increase) and seeing if the policy’s adjustments still make sense. We’d also evaluate the cumulative profit over the promotion period: does the RL agent capture the upside of increased volume while mitigating the downside of lower margin? If so, it should outperform a policy that just continues business-as-usual or one that over-indexes on promoted items without regard for margin.
Scenario 4: Shifting Consumer Preferences
Tastes change over time. Imagine over a few months, a new product category grows in popularity (e.g. plant-based foods) and another wanes (e.g. sugary snacks). If our RL agent is periodically retrained or continues to learn, it should gradually adjust the assortment mix to include more of the rising category and phase out the declining one. But if such a shift is gradual, we should ensure the agent isn’t stuck in a local optimum. We might simulate a trend by slowly increasing the base demand of one product and decreasing another across episodes. A robust RL policy with some exploration or regular retraining will track this trend. We measure how quickly the policy’s actions change in response. If it takes too long, that indicates a need for either more frequent policy updates or incorporating a mechanism to detect non-stationarity (like including time as state, which we have). Ideally, the agent shows graceful adaptation – smoothly changing the assortment over time, rather than abrupt jumps or failing to adapt. This scenario tests policy adaptability and can highlight if an RL policy might need occasional refreshing.
Scenario 5: Model Mis-specification and Unseen Situations
No simulation or training data can cover every possible future scenario (2020 was proof of that in retail). We consider an extreme case: an entirely new product (never seen in training) becomes available, or an external event causes a completely new pattern of demand. While RL excels at interpolation (adapting within the domain it was trained on), extrapolation is hard. To probe this, we can feed the trained agent some unusual states that were not common during training (e.g. all inventories full but zero demand for a week, or conversely, inventories near empty due to a panic buying event). We evaluate how the agent behaves – does it still follow reasonable policies or does it produce erratic actions? This kind of sensitivity test is more qualitative; it might reveal the need for safe fail-safes (like business rules to override the agent in unprecedented situations). It’s also where human expertise and RL need to complement each other – planners might step in to guide the model if something truly novel occurs. In our testing, if we find the agent flounders in a certain edge case, we can incorporate that scenario into the training process (data augmentation in simulation) or add constraints to prevent bad decisions.
To perform these analyses, we rely on simulation-based testing. For each scenario, we run many simulation episodes with the RL policy fixed (no learning, just evaluation mode) under the scenario conditions and record metrics: total revenue, service level (percent demand fulfilled), inventory leftover, etc. We often compare the RL policy to one or more baseline strategies:
- A greedy policy that always picks the current top-K selling products (myopic approach).
- A static optimal policy derived by an optimization model given perfect knowledge of the scenario’s demand (which is an upper bound of what any policy could do).
- Possibly a human-designed heuristic (e.g. “50% shelf to staples, 30% to emerging items, 20% rotating experiments”).
These comparisons show the value of RL. We expect to see that in stable scenarios, RL matches the traditional plan (it doesn’t do worse), and in volatile scenarios, RL outperforms by adapting. For instance, in simulations, the deep RL approach was shown to yield higher long-term revenue than existing methods across various tested conditions (Deep Reinforcement Learning for Online Assortment Customization: A Data-Driven Approach | Request PDF), indicating strong performance robustness. By conducting such sensitivity analysis, we build confidence that the RL solution is not a black box that only works in training conditions, but a reliable decision agent that can handle the ups and downs of the real retail environment.
Example Python Code Implementation
To demonstrate how an RL solution can be implemented, we will provide a comprehensive example using Python. We’ll create a simplified assortment planning environment and train a Deep RL agent (using PyTorch) to optimize the assortment. The example will be self-contained and documented step by step.
Note: This is a stylized example for illustration. In a production setting, the environment would be far more complex, and you would leverage distributed training, extensive simulation, and possibly frameworks like Ray RLlib or Stable Baselines. Here, we use a basic Actor-Critic approach to keep things clear.
Environment Setup
First, we define a custom environment class RetailAssortmentEnv
that simulates the assortment planning problem. This environment will:
- Initialize with a certain number of products, their demand profiles, prices, and an initial inventory.
- At each step (customer arrival), allow the agent to choose an assortment (subset of products to offer).
- Simulate the customer’s purchase based on a choice model (taking into account the offered assortment and product preferences).
- Return the reward (revenue from that purchase) and update the state (inventory levels, move to next customer).
- End an episode after a fixed number of customers or when time is up.
We’ll use a simple Multinomial Logit choice model for customers: each product \(i\) has an inherent attractiveness weight \(w_i\). When presented with an assortment, the probability a customer buys product \(i\) is \(w_i\) divided by the sum of weights of all offered products (including a weight for the “no-purchase” option). This captures substitution: if a customer’s most preferred item isn’t offered, they might buy their second choice or buy nothing if nothing appealing is offered.
import numpy as np
class RetailAssortmentEnv:
def __init__(self, product_weights, product_prices, initial_inventory, max_assortment_size, seed=None):
"""
Initialize the environment.
:param product_weights: array of base attractiveness weights for each product.
:param product_prices: array of prices (or profit per unit) for each product.
:param initial_inventory: array of initial stock for each product.
:param max_assortment_size: the maximum number of products the agent can offer at once.
:param seed: random seed for reproducibility.
"""
self.product_weights = np.array(product_weights, dtype=float)
self.product_prices = np.array(product_prices, dtype=float)
self.initial_inventory = np.array(initial_inventory, dtype=int)
self.max_assortment_size = max_assortment_size
self.n_products = len(product_weights)
self.rng = np.random.RandomState(seed)
self.reset()
def reset(self):
"""Start a new episode: reset inventory and step counter, return initial state."""
self.inventory = self.initial_inventory.copy()
self.t = 0 # time step or customer index
# State: we'll use a concatenation of inventory levels and the normalized time.
# Normalize inventory and time for neural network stability (optional).
time_feature = [self.t / self.max_steps] if hasattr(self, 'max_steps') else [0.0]
state = np.concatenate([self.inventory / (self.initial_inventory + 1e-5), time_feature])
return state.astype(np.float32)
def step(self, action):
"""
Simulate one customer interaction given the chosen assortment action.
:param action: list or array of product indices that are offered to the customer.
:return: (next_state, reward, done, info)
"""
offered = list(action)
assert len(offered) <= self.max_assortment_size, "Offered more products than allowed."
# Remove any products with zero inventory from the offered set (can't sell what we don't have)
offered = [i for i in offered if self.inventory[i] > 0]
# If nothing offered or no stock, customer definitely buys nothing.
if len(offered) == 0:
reward = 0.0
else:
# Determine purchase probabilities for offered products:
weights = self.product_weights[offered]
# Include an outside option weight = 1.0 (tuning this can adjust how likely no-purchase is)
outside_weight = 1.0
total_weight = outside_weight + weights.sum()
# Probability of no purchase:
# p_no_purchase = outside_weight / total_weight
# Probability of each product purchase:
probs = weights / total_weight
# Choose outcome based on probabilities
choice = self.rng.choice(len(offered) + 1, p=np.append(probs, outside_weight/total_weight))
# The choice index corresponds to a product in offered list, or the last index for no-purchase.
if choice < len(offered):
chosen_product = offered[choice]
# Process purchase
self.inventory[chosen_product] -= 1 # reduce inventory
reward = self.product_prices[chosen_product]
else:
# No purchase
chosen_product = None
reward = 0.0
# Increment time step
self.t += 1
# Compute next state
time_feature = [self.t / self.max_steps] if hasattr(self, 'max_steps') else [0.0]
next_state = np.concatenate([self.inventory / (self.initial_inventory + 1e-5), time_feature]).astype(np.float32)
# Check if episode is done (for example, after fixed number of customers or no inventory left)
done = False
if hasattr(self, 'max_steps') and self.t >= self.max_steps:
done = True
if self.inventory.sum() == 0:
# End episode if all stock sold out (no more decisions can be made usefully)
done = True
info = {"chosen_product": chosen_product}
return next_state, reward, done, info
def set_max_steps(self, max_steps):
"""Optional: define a maximum number of steps (customers) per episode."""
self.max_steps = max_steps
Explanation: In this environment:
- We have an array
product_weights
representing each product’s baseline appeal. product_prices
gives the reward per sale for each product.initial_inventory
is set at the start (this could represent stock for a day or season).max_assortment_size
is the constraint on how many products can be offered simultaneously.- The state is represented as the inventory levels (normalized) plus a time feature. We include time if we set
max_steps
(episode length) to let the agent know how far along we are. - In
step()
, given an action (assortment) as a list of product indices, we first ensure we don’t include out-of-stock items. We then simulate a single customer’s choice: - We compute purchase probabilities for each offered product proportional to its weight, and include an “outside option” with weight 1. This outside option accounts for the chance the customer buys nothing even if some assortment is offered (the higher we set this outside weight, the more often no purchase occurs).
- We randomly choose an outcome according to these probabilities. If a product is chosen, we decrement its inventory and set the reward to its price. If no purchase, reward is 0.
- We increment time and return the next state (updated inventory, incremented time fraction). We signal
done
if we reached the max number of steps for the episode or if inventory is completely depleted (episode ends early if everything sold). - The environment is stochastic due to the random choice, but that’s fine for training as the agent will learn expected values.
This environment is a simplified proxy for a store where each time step is one customer. In reality, one might accumulate multiple customers per assortment decision (e.g. per day a fixed assortment, then multiple customers come). But modeling per-customer keeps a constant interaction frequency and the agent implicitly learns day-level effects by seeing sequences of customers.
Now we have an environment; next we’ll implement the RL agent.
RL Agent Implementation (Actor-Critic with PyTorch)
We will use an Actor-Critic neural network to learn the assortment policy. The network will output:
- A policy: distribution over actions (assortments). However, enumerating all combinations as actions is infeasible even in this small example if we have many products. Instead, we will implement the policy as choosing products greedily by their scores (as a simplification). Another approach is to sample each product’s inclusion independently (treat as multi-action). For simplicity, we’ll define actions as “choose a fixed number of products = max_assortment_size” and enumerate all combinations of that size as possible actions. Given a small number of products, this is manageable.
- A value: estimate of expected return from the state (for the critic).
For the sake of clarity, we will do the following in code:
- Pre-compute all possible assortments of size
K = max_assortment_size
out of N products (this yields an action list). - Our policy network will output a score (logit) for each possible action (assortment combination). We then apply a softmax to get probabilities and sample an action during training (or pick max during evaluation).
- This is effectively treating each assortment combination as a discrete action choice in a traditional sense, which is fine for small N. (If N were larger, we wouldn’t enumerate but use a different policy parameterization.)
We will then train the network using policy gradient with a value baseline:
- Collect trajectories of states, actions, rewards.
- Compute returns \(G_t\) for each time step (discounted sum of future rewards).
- Compute advantages \(A_t = G_t – V(s_t)\) using the current value estimates.
- Update the policy by policy gradient: maximize \(E[\log \pi(a_t|s_t) * A_t]\) (which in code we do via gradient ascent or equivalently minimize \(-\log \pi * A\)).
- Update the value function by minimizing MSE between \(V(s_t)\) and \(G_t\) (or use \(A_t\) for TD advantage if doing multi-step).
- Use learning rate to control update sizes.
We’ll use PyTorch to build this model and perform gradient updates.
import torch
import torch.nn as nn
import torch.optim as optim
from itertools import combinations
class ActorCriticNet(nn.Module):
def __init__(self, state_size, action_count):
super(ActorCriticNet, self).__init__()
# Simple two-layer neural network for both actor and critic
hidden_size = 128
self.fc1 = nn.Linear(state_size, hidden_size)
self.relu = nn.ReLU()
# Actor output for each action (assortment combination)
self.policy_head = nn.Linear(hidden_size, action_count)
# Critic output for state value
self.value_head = nn.Linear(hidden_size, 1)
def forward(self, state):
x = self.relu(self.fc1(state))
policy_logits = self.policy_head(x) # raw scores for each action
state_value = self.value_head(x) # state value estimate
return policy_logits, state_value
# Helper: generate all combinations of a given size as possible actions
def generate_actions(num_products, assortment_size):
return list(combinations(range(num_products), assortment_size))
# Instantiate environment and model
num_products = 5
max_assortment = 3
product_weights = [1.5, 1.2, 1.0, 0.7, 0.5] # Base attractiveness for each product
product_prices = [100, 80, 60, 50, 40] # Profit per product (could correlate with weights or not)
initial_stock = [5, 5, 5, 5, 5] # Starting inventory for each product
env = RetailAssortmentEnv(product_weights, product_prices, initial_stock, max_assortment, seed=42)
env.set_max_steps(20) # limit episode to 20 customers for training
state_size = env.n_products + 1 # inventory vector length + time feature
actions_list = generate_actions(env.n_products, max_assortment)
action_count = len(actions_list)
print(f"Total possible actions (combinations of {max_assortment} out of {env.n_products}):", action_count)
model = ActorCriticNet(state_size, action_count)
optimizer = optim.Adam(model.parameters(), lr=0.01)
Explanation:
- We define
ActorCriticNet
with a simple architecture: one hidden layer (128 neurons, ReLU), then two heads:policy_head
outputs a score for each possible action (so output dimension = number of combinations), andvalue_head
outputs a single value (state value estimate). - We generate all combinations of 3 out of 5 products as possible actions using
itertools.combinations
. For 5 choose 3, that yields 10 actions. - We instantiate the environment with 5 products and give each a price. (For example, product0 weight 1.5, price 100; product4 weight 0.5, price 40. Perhaps higher weight correlates with price in this setup.)
- We set
max_steps=20
meaning each episode will simulate 20 customers (or fewer if stock runs out). - State size = 6 (5 inventory levels + 1 time feature).
- We create the model and an Adam optimizer.
Now, we will train the agent. We’ll use a simple loop over episodes, and within each episode, loop over steps to generate an episode trajectory, then perform policy/value updates. We’ll also log some metrics like total reward per episode to monitor training.
# Training the actor-critic agent on the environment
num_episodes = 1000
gamma = 0.99 # discount factor
for episode in range(1, num_episodes+1):
state = env.reset()
state = torch.from_numpy(state).unsqueeze(0) # convert to tensor [1, state_size]
episode_rewards = []
log_probs = []
values = []
rewards = []
# Generate an episode
done = False
while not done:
# Forward pass: get policy logits and value
policy_logits, state_value = model(state)
# Sample an action according to the policy distribution (softmax)
policy_dist = torch.distributions.Categorical(logits=policy_logits)
action_index = policy_dist.sample() # sample an action index from 0...action_count-1
log_prob = policy_dist.log_prob(action_index)
# Take that action in the environment
action = actions_list[action_index.item()]
next_state, reward, done, info = env.step(action)
# Record metrics
log_probs.append(log_prob)
values.append(state_value.squeeze(0))
rewards.append(reward)
episode_rewards.append(reward)
# Move to next state
state = torch.from_numpy(next_state).unsqueeze(0)
# Episode ended, compute returns and advantages
# Convert rewards to tensor
rewards_tensor = torch.tensor(rewards, dtype=torch.float32)
returns = []
G = 0.0
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
returns = torch.tensor(returns, dtype=torch.float32)
# Normalize returns for stability (optional)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
values_tensor = torch.stack(values)
# Compute advantage estimates
advantages = returns - values_tensor.detach().squeeze()
# Policy loss and value loss
log_probs_tensor = torch.stack(log_probs)
policy_loss = - (log_probs_tensor * advantages).mean()
value_loss = nn.functional.mse_loss(values_tensor.squeeze(), returns)
loss = policy_loss + 0.5 * value_loss # combine losses (value loss weighted 0.5)
# Backpropagate
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Occasionally, print training progress
if episode % 100 == 0:
total_reward = sum(episode_rewards)
print(f"Episode {episode}, Total Reward: {total_reward:.2f}")
Explanation:
- We loop for a number of episodes (1000). In each episode, we reset the environment to start fresh.
- We then run the policy until the episode is done. At each step:
- We compute
policy_logits
andstate_value
from the network. - We create a categorical distribution from logits (PyTorch does softmax internally for sampling).
- We sample an action index from this distribution (this is the exploration happening).
- We map the index to the actual assortment (combination of product indices) using
actions_list
. - Step the environment with that assortment.
- Collect
log_prob
of the chosen action (for policy gradient) and thestate_value
and the actualreward
. - After the episode, we compute the return \(G_t\) for each time step by folding back rewards with discount
gamma
. This gives us a sequence of returns of the same length asrewards
. - We standardize the returns (subtract mean, divide std) to help training (common in policy gradient to reduce variance).
- The advantage estimate is
returns - values
(we detachvalues
when computing advantage to avoid backprop through it in the policy loss). - Policy loss is computed as \(-\mathbb{E}[\log \pi(a|s) A]\) – here we take the mean over the episode’s steps.
- Value loss is the MSE between the predicted values and the returns (which serve as empirical returns).
- We combine them (with a 0.5 weight on value loss) and backpropagate to update the network.
- Every 100 episodes, we print the total reward of that episode as a crude progress indicator.
This training loop uses Monte Carlo returns (no bootstrapping beyond episode end). In more advanced setups, one could use TD(\(\lambda\)) or GAE (Generalized Advantage Estimation) for smoother advantages, and train on batches of episodes with parallel environments for efficiency. But our loop is straightforward and sufficient for a small example.
Now, let’s run this training and see if our agent learns a sensible policy. We expect that products with high weight and high price (like product0) will be favored, but the agent should also learn to include substitutes when one runs out, etc.