How Winder.AI Helped Duetto Evaluate Reinforcement Learning for Hotel Pricing

Duetto is the global leader in hotel revenue strategy, powering pricing and profitability decisions for thousands of properties worldwide. Its cloud-native platform combines real‑time data, advanced forecasting, and flexible pricing automation to help hotels capture demand, outperform competitors, and adapt to rapidly shifting market conditions. As the industry evolves and pricing complexity accelerates, Duetto wanted to understand whether reinforcement learning (RL) could push beyond the limits of traditional optimisation approaches.

To find out, Duetto partnered with Winder.AI, an AI consultancy with deep expertise in reinforcement learning and rigorous AI research.

The Challenge

Duetto’s existing pricing logic relied on heuristics and optimisation techniques that had served well but were becoming harder to extend. The team identified several limitations:

Static decision rules struggled to adapt to shifting demand regimes, seasonal patterns, and competitive dynamics.
Long-term revenue optimisation was difficult to model with traditional approaches that optimise individual pricing decisions in isolation.
Scaling personalisation across thousands of hotels, each with different market characteristics, required an approach that could learn and generalise.
Data sparsity is inherent to hotel pricing: booking data is sparse and each hotel has limited historical observations for any specific combination of season, day-of-week, and booking horizon.

The question was not whether to build RL into production immediately, but whether RL was a viable path worth serious investment. Getting this wrong in either direction, dismissing a genuine opportunity or over-investing in an unproven approach, carried significant risk.

Business Context

Hotel revenue management is a high-stakes domain. Prices are set at the beginning of each day and affect revenue directly. Even a small improvement in pricing accuracy across thousands of properties translates to substantial revenue impact.

“We were exploring whether reinforcement learning could meaningfully improve hotel pricing decisions beyond traditional optimization approaches. Our need was to evaluate RL in a rigorous, research-driven way, understand where it adds value, and identify the technical and data requirements to make it viable at scale for thousands of hotels.”
Sr. Director of Data & ML, Duetto Research

Why Winder.AI

Duetto selected Winder.AI for its combination of deep RL expertise and pragmatic understanding of production constraints. The engagement required senior-level thinking, not off-the-shelf solutions. Winder.AI’s team, authors of O’Reilly’s Reinforcement Learning, brought experience from commercial RL deployments across multiple industries.

“Their depth of RL expertise combined with a pragmatic understanding of real-world constraints. They were comfortable operating in ambiguity, iterating quickly, and engaging as thought partners rather than just executing predefined tasks.”
Sr. Director of Data & ML, Duetto Research

Approach and Methodology

Phase 1 - Discovery and Domain Understanding

Workshops with Duetto’s data science and platform teams to align on research goals, metrics, and constraints
Formalised the pricing task as a Markov Decision Process: state (booking context and seasonality), action (price), and reward (revenue)
Assessed data quality and availability across candidate hotels
Built reusable pipelines for data retrieval, transformation, and experiment tracking

Phase 2 - Behavioural Cloning Baseline

Before attempting reward-maximising RL, Winder.AI established a behavioural cloning (BC) baseline to verify the data pipeline and neural network architecture could learn pricing patterns from historical data. This phase uncovered critical findings around normalisation, training formulation, and network architecture that informed all subsequent experiments.

Phase 3 - Offline RL Experiments

With the baseline established, the team moved to Implicit Q-Learning (IQL), an offline RL algorithm that learns to improve upon expert behaviour without requiring a live environment. Key areas of investigation included:

Feature engineering: Enriched state representations with domain-specific encodings and hotel-level embeddings
Reward and normalisation design: Adapted reward structures and normalisation strategies for the sparse booking data
Hyperparameter adaptation: Published IQL defaults assume denser reward signals. The team identified the parameter regimes necessary for sparse hotel pricing data
Pooled training: Training across multiple hotels with embeddings outperformed single-hotel training, which is critical given the limited data per individual hotel
World model integration: Demand curve models were integrated to provide consistent reward signals for training and evaluation

Phase 4 - Evaluation Methodology

Evaluating an offline RL pricing agent is fundamentally difficult because ground truth demand at counterfactual prices is unobservable. Winder.AI developed a multi-metric evaluation approach:

Revenue lift estimation using demand model predictions, with careful analysis of how model bias affects per-hotel metrics
Logical constraint tests to verify the agent learns sensible pricing behaviour in known scenarios
Feature importance analysis to understand which state features drive pricing decisions
Bias quantification: The team discovered and quantified a strong correlation between demand model prediction error and reported revenue lift, which was critical for interpreting results honestly

Collaborative Research Approach

This was treated as genuine research, not a feature delivery project. Winder.AI worked alongside Duetto’s internal data science and platform teams, aligning on metrics throughout. Failures were investigated and discussed openly. When results appeared too good, the team looked for confounding factors rather than celebrating prematurely.

Results

Result Area	Outcome	Impact
RL Feasibility	Validated as viable for hotel pricing	Confidence to invest in RL roadmap
RL Performance	IQL outperformed behavioural cloning baseline	Proved RL adds value beyond imitation of expert behaviour
Revenue Lift	Positive lift signal in pooled experiments	Identified highest-value deployment targets
Data Quality	Gaps identified and documented	Focused data engineering investment
Evaluation Methodology	Multi-metric framework established	Reusable for future RL experiments
Risk Reduction	Complex initiative de-risked early	Avoided premature production investment

Strategic Impact

The most valuable outcome was clarity. Rather than committing to a full production RL system or abandoning the approach entirely, Duetto gained a precise understanding of:

Where RL outperforms simpler approaches and where it does not
That pooled training across hotel segments is essential for generalisation given the limited data per individual hotel
What data quality prerequisites must be met before production use
That evaluation methodology is as important as the RL algorithm itself
Which training strategies and parameter regimes are necessary for sparse hotel pricing data

This focused future investment on the areas with the highest expected return.

Customer Feedback

“The willingness to confront hard problems head-on rather than optimizing for superficial wins. The work was treated as true research, with careful debugging and transparent discussion of failures as well as successes. That rigor significantly increased our confidence in the conclusions.” Recommendation Score: 9 / 10
Sr. Director of Data & ML, Duetto Research

Key Takeaways

Rigorous offline evaluation de-risked a complex RL initiative before production commitment
Behavioural cloning baselines are essential for validating the data pipeline before attempting reward maximisation
Data sparsity is the fundamental challenge in hotel pricing RL and requires careful adaptation of standard algorithms
Pooled training across hotels outperforms individual hotel models when per-hotel data is scarce
Evaluation methodology matters as much as the algorithm: understanding the biases in your evaluation is as important as the results themselves
Transparent research methodology, including open discussion of failures, built genuine confidence in the results

Why Winder.AI

“I would recommend Winder.AI to teams tackling genuinely hard ML or RL problems where correctness, evaluation rigor, and long-term impact matter more than quick demos. They are particularly strong when you need senior-level thinking and collaboration rather than off-the-shelf solutions.”
Sr. Director of Data & ML, Duetto Research

If your organisation is exploring reinforcement learning for pricing, operations, or decision-making at scale, Winder.AI can help you evaluate feasibility, design experiments, and focus investment where it matters most. Get in touch.

How Winder.AI Helped Duetto Evaluate Reinforcement Learning for Hotel Pricing

The Challenge

Business Context

Why Winder.AI

Approach and Methodology

Phase 1 - Discovery and Domain Understanding

Phase 2 - Behavioural Cloning Baseline

Phase 3 - Offline RL Experiments

Phase 4 - Evaluation Methodology

Collaborative Research Approach

Results

Strategic Impact

Customer Feedback

Key Takeaways

Why Winder.AI

More articles

Reinforcement Learning In Finance

Optimising Industrial Processes with Reinforcement Learning