How Winder.AI Helped Duetto Evaluate Reinforcement Learning for Hotel Pricing
by Dr. Phil Winder , CEO
Duetto is the global leader in hotel revenue strategy, powering pricing and profitability decisions for thousands of properties worldwide. Its cloud-native platform combines real‑time data, advanced forecasting, and flexible pricing automation to help hotels capture demand, outperform competitors, and adapt to rapidly shifting market conditions. As the industry evolves and pricing complexity accelerates, Duetto wanted to understand whether reinforcement learning (RL) could push beyond the limits of traditional optimisation approaches.
To find out, Duetto partnered with Winder.AI, an AI consultancy with deep expertise in reinforcement learning and rigorous AI research.
The Challenge
Duetto’s existing pricing logic relied on heuristics and optimisation techniques that had served well but were becoming harder to extend. The team identified several limitations:
- Static decision rules struggled to adapt to shifting demand regimes, seasonal patterns, and competitive dynamics.
- Long-term revenue optimisation was difficult to model with traditional approaches that optimise individual pricing decisions in isolation.
- Scaling personalisation across thousands of hotels, each with different market characteristics, required an approach that could learn and generalise.
- Data sparsity is inherent to hotel pricing: booking data is sparse and each hotel has limited historical observations for any specific combination of season, day-of-week, and booking horizon.
The question was not whether to build RL into production immediately, but whether RL was a viable path worth serious investment. Getting this wrong in either direction, dismissing a genuine opportunity or over-investing in an unproven approach, carried significant risk.
Business Context
Hotel revenue management is a high-stakes domain. Prices are set at the beginning of each day and affect revenue directly. Even a small improvement in pricing accuracy across thousands of properties translates to substantial revenue impact.
“We were exploring whether reinforcement learning could meaningfully improve hotel pricing decisions beyond traditional optimization approaches. Our need was to evaluate RL in a rigorous, research-driven way, understand where it adds value, and identify the technical and data requirements to make it viable at scale for thousands of hotels.”
- Sr. Director of Data & ML, Duetto Research
Why Winder.AI
Duetto selected Winder.AI for its combination of deep RL expertise and pragmatic understanding of production constraints. The engagement required senior-level thinking, not off-the-shelf solutions. Winder.AI’s team, authors of O’Reilly’s Reinforcement Learning, brought experience from commercial RL deployments across multiple industries.
“Their depth of RL expertise combined with a pragmatic understanding of real-world constraints. They were comfortable operating in ambiguity, iterating quickly, and engaging as thought partners rather than just executing predefined tasks.”
- Sr. Director of Data & ML, Duetto Research
Approach and Methodology
Phase 1 - Discovery and Domain Understanding
- Workshops with Duetto’s data science and platform teams to align on research goals, metrics, and constraints
- Formalised the pricing task as a Markov Decision Process: state (booking context and seasonality), action (price), and reward (revenue)
- Assessed data quality and availability across candidate hotels
- Built reusable pipelines for data retrieval, transformation, and experiment tracking
Phase 2 - Behavioural Cloning Baseline
Before attempting reward-maximising RL, Winder.AI established a behavioural cloning (BC) baseline to verify the data pipeline and neural network architecture could learn pricing patterns from historical data. This phase uncovered critical findings around normalisation, training formulation, and network architecture that informed all subsequent experiments.
Phase 3 - Offline RL Experiments
With the baseline established, the team moved to Implicit Q-Learning (IQL), an offline RL algorithm that learns to improve upon expert behaviour without requiring a live environment. Key areas of investigation included:
- Feature engineering: Enriched state representations with domain-specific encodings and hotel-level embeddings
- Reward and normalisation design: Adapted reward structures and normalisation strategies for the sparse booking data
- Hyperparameter adaptation: Published IQL defaults assume denser reward signals. The team identified the parameter regimes necessary for sparse hotel pricing data
- Pooled training: Training across multiple hotels with embeddings outperformed single-hotel training, which is critical given the limited data per individual hotel
- World model integration: Demand curve models were integrated to provide consistent reward signals for training and evaluation
Phase 4 - Evaluation Methodology
Evaluating an offline RL pricing agent is fundamentally difficult because ground truth demand at counterfactual prices is unobservable. Winder.AI developed a multi-metric evaluation approach:
- Revenue lift estimation using demand model predictions, with careful analysis of how model bias affects per-hotel metrics
- Logical constraint tests to verify the agent learns sensible pricing behaviour in known scenarios
- Feature importance analysis to understand which state features drive pricing decisions
- Bias quantification: The team discovered and quantified a strong correlation between demand model prediction error and reported revenue lift, which was critical for interpreting results honestly
Collaborative Research Approach
This was treated as genuine research, not a feature delivery project. Winder.AI worked alongside Duetto’s internal data science and platform teams, aligning on metrics throughout. Failures were investigated and discussed openly. When results appeared too good, the team looked for confounding factors rather than celebrating prematurely.
Results
| Result Area | Outcome | Impact |
|---|---|---|
| RL Feasibility | Validated as viable for hotel pricing | Confidence to invest in RL roadmap |
| RL Performance | IQL outperformed behavioural cloning baseline | Proved RL adds value beyond imitation of expert behaviour |
| Revenue Lift | Positive lift signal in pooled experiments | Identified highest-value deployment targets |
| Data Quality | Gaps identified and documented | Focused data engineering investment |
| Evaluation Methodology | Multi-metric framework established | Reusable for future RL experiments |
| Risk Reduction | Complex initiative de-risked early | Avoided premature production investment |
Strategic Impact
The most valuable outcome was clarity. Rather than committing to a full production RL system or abandoning the approach entirely, Duetto gained a precise understanding of:
- Where RL outperforms simpler approaches and where it does not
- That pooled training across hotel segments is essential for generalisation given the limited data per individual hotel
- What data quality prerequisites must be met before production use
- That evaluation methodology is as important as the RL algorithm itself
- Which training strategies and parameter regimes are necessary for sparse hotel pricing data
This focused future investment on the areas with the highest expected return.
Customer Feedback
“The willingness to confront hard problems head-on rather than optimizing for superficial wins. The work was treated as true research, with careful debugging and transparent discussion of failures as well as successes. That rigor significantly increased our confidence in the conclusions.” Recommendation Score: 9 / 10
- Sr. Director of Data & ML, Duetto Research
Key Takeaways
- Rigorous offline evaluation de-risked a complex RL initiative before production commitment
- Behavioural cloning baselines are essential for validating the data pipeline before attempting reward maximisation
- Data sparsity is the fundamental challenge in hotel pricing RL and requires careful adaptation of standard algorithms
- Pooled training across hotels outperforms individual hotel models when per-hotel data is scarce
- Evaluation methodology matters as much as the algorithm: understanding the biases in your evaluation is as important as the results themselves
- Transparent research methodology, including open discussion of failures, built genuine confidence in the results
Why Winder.AI
“I would recommend Winder.AI to teams tackling genuinely hard ML or RL problems where correctness, evaluation rigor, and long-term impact matter more than quick demos. They are particularly strong when you need senior-level thinking and collaboration rather than off-the-shelf solutions.”
- Sr. Director of Data & ML, Duetto Research
If your organisation is exploring reinforcement learning for pricing, operations, or decision-making at scale, Winder.AI can help you evaluate feasibility, design experiments, and focus investment where it matters most. Get in touch.