How Winder.AI Helped Duetto Evaluate Reinforcement Learning for Hotel Pricing

by Dr. Phil Winder , CEO

Duetto is the global leader in hotel revenue strategy, powering pricing and profitability decisions for thousands of properties worldwide. Its cloud-native platform combines real‑time data, advanced forecasting, and flexible pricing automation to help hotels capture demand, outperform competitors, and adapt to rapidly shifting market conditions. As the industry evolves and pricing complexity accelerates, Duetto wanted to understand whether reinforcement learning (RL) could push beyond the limits of traditional optimisation approaches.

To find out, Duetto partnered with Winder.AI, an AI consultancy with deep expertise in reinforcement learning and rigorous AI research.

The Challenge

Duetto’s existing pricing logic relied on heuristics and optimisation techniques that had served well but were becoming harder to extend. The team identified several limitations:

  • Static decision rules struggled to adapt to shifting demand regimes, seasonal patterns, and competitive dynamics.
  • Long-term revenue optimisation was difficult to model with traditional approaches that optimise individual pricing decisions in isolation.
  • Scaling personalisation across thousands of hotels, each with different market characteristics, required an approach that could learn and generalise.
  • Data sparsity is inherent to hotel pricing: booking data is sparse and each hotel has limited historical observations for any specific combination of season, day-of-week, and booking horizon.

The question was not whether to build RL into production immediately, but whether RL was a viable path worth serious investment. Getting this wrong in either direction, dismissing a genuine opportunity or over-investing in an unproven approach, carried significant risk.

Business Context

Hotel revenue management is a high-stakes domain. Prices are set at the beginning of each day and affect revenue directly. Even a small improvement in pricing accuracy across thousands of properties translates to substantial revenue impact.

“We were exploring whether reinforcement learning could meaningfully improve hotel pricing decisions beyond traditional optimization approaches. Our need was to evaluate RL in a rigorous, research-driven way, understand where it adds value, and identify the technical and data requirements to make it viable at scale for thousands of hotels.”

  • Sr. Director of Data & ML, Duetto Research

Why Winder.AI

Duetto selected Winder.AI for its combination of deep RL expertise and pragmatic understanding of production constraints. The engagement required senior-level thinking, not off-the-shelf solutions. Winder.AI’s team, authors of O’Reilly’s Reinforcement Learning, brought experience from commercial RL deployments across multiple industries.

“Their depth of RL expertise combined with a pragmatic understanding of real-world constraints. They were comfortable operating in ambiguity, iterating quickly, and engaging as thought partners rather than just executing predefined tasks.”

  • Sr. Director of Data & ML, Duetto Research

Approach and Methodology

Phase 1 - Discovery and Domain Understanding

  • Workshops with Duetto’s data science and platform teams to align on research goals, metrics, and constraints
  • Formalised the pricing task as a Markov Decision Process: state (booking context and seasonality), action (price), and reward (revenue)
  • Assessed data quality and availability across candidate hotels
  • Built reusable pipelines for data retrieval, transformation, and experiment tracking

Phase 2 - Behavioural Cloning Baseline

Before attempting reward-maximising RL, Winder.AI established a behavioural cloning (BC) baseline to verify the data pipeline and neural network architecture could learn pricing patterns from historical data. This phase uncovered critical findings around normalisation, training formulation, and network architecture that informed all subsequent experiments.

Phase 3 - Offline RL Experiments

With the baseline established, the team moved to Implicit Q-Learning (IQL), an offline RL algorithm that learns to improve upon expert behaviour without requiring a live environment. Key areas of investigation included:

  • Feature engineering: Enriched state representations with domain-specific encodings and hotel-level embeddings
  • Reward and normalisation design: Adapted reward structures and normalisation strategies for the sparse booking data
  • Hyperparameter adaptation: Published IQL defaults assume denser reward signals. The team identified the parameter regimes necessary for sparse hotel pricing data
  • Pooled training: Training across multiple hotels with embeddings outperformed single-hotel training, which is critical given the limited data per individual hotel
  • World model integration: Demand curve models were integrated to provide consistent reward signals for training and evaluation

Phase 4 - Evaluation Methodology

Evaluating an offline RL pricing agent is fundamentally difficult because ground truth demand at counterfactual prices is unobservable. Winder.AI developed a multi-metric evaluation approach:

  • Revenue lift estimation using demand model predictions, with careful analysis of how model bias affects per-hotel metrics
  • Logical constraint tests to verify the agent learns sensible pricing behaviour in known scenarios
  • Feature importance analysis to understand which state features drive pricing decisions
  • Bias quantification: The team discovered and quantified a strong correlation between demand model prediction error and reported revenue lift, which was critical for interpreting results honestly

Collaborative Research Approach

This was treated as genuine research, not a feature delivery project. Winder.AI worked alongside Duetto’s internal data science and platform teams, aligning on metrics throughout. Failures were investigated and discussed openly. When results appeared too good, the team looked for confounding factors rather than celebrating prematurely.

Results

Result AreaOutcomeImpact
RL FeasibilityValidated as viable for hotel pricingConfidence to invest in RL roadmap
RL PerformanceIQL outperformed behavioural cloning baselineProved RL adds value beyond imitation of expert behaviour
Revenue LiftPositive lift signal in pooled experimentsIdentified highest-value deployment targets
Data QualityGaps identified and documentedFocused data engineering investment
Evaluation MethodologyMulti-metric framework establishedReusable for future RL experiments
Risk ReductionComplex initiative de-risked earlyAvoided premature production investment

Strategic Impact

The most valuable outcome was clarity. Rather than committing to a full production RL system or abandoning the approach entirely, Duetto gained a precise understanding of:

  • Where RL outperforms simpler approaches and where it does not
  • That pooled training across hotel segments is essential for generalisation given the limited data per individual hotel
  • What data quality prerequisites must be met before production use
  • That evaluation methodology is as important as the RL algorithm itself
  • Which training strategies and parameter regimes are necessary for sparse hotel pricing data

This focused future investment on the areas with the highest expected return.

Customer Feedback

“The willingness to confront hard problems head-on rather than optimizing for superficial wins. The work was treated as true research, with careful debugging and transparent discussion of failures as well as successes. That rigor significantly increased our confidence in the conclusions.” Recommendation Score: 9 / 10

  • Sr. Director of Data & ML, Duetto Research

Key Takeaways

  • Rigorous offline evaluation de-risked a complex RL initiative before production commitment
  • Behavioural cloning baselines are essential for validating the data pipeline before attempting reward maximisation
  • Data sparsity is the fundamental challenge in hotel pricing RL and requires careful adaptation of standard algorithms
  • Pooled training across hotels outperforms individual hotel models when per-hotel data is scarce
  • Evaluation methodology matters as much as the algorithm: understanding the biases in your evaluation is as important as the results themselves
  • Transparent research methodology, including open discussion of failures, built genuine confidence in the results

Why Winder.AI

“I would recommend Winder.AI to teams tackling genuinely hard ML or RL problems where correctness, evaluation rigor, and long-term impact matter more than quick demos. They are particularly strong when you need senior-level thinking and collaboration rather than off-the-shelf solutions.”

  • Sr. Director of Data & ML, Duetto Research

If your organisation is exploring reinforcement learning for pricing, operations, or decision-making at scale, Winder.AI can help you evaluate feasibility, design experiments, and focus investment where it matters most. Get in touch.

More articles

Reinforcement Learning In Finance

Reinforcement learning consulting for finance companies. Learn how Winder.AI collaborated with a finance company to optimize their customer journeys.

Read more

Optimising Industrial Processes with Reinforcement Learning

In this case study, we will discuss the use of reinforcement learning to optimise industrial processes. Find out how to use reinforcement learning to automate procedures that are time-consuming and difficult to understand.

Read more
}