Deep Reinforcement Learning Workshop - Hands-on with Deep RL

by Dr. Phil Winder , CEO

This is a video of a workshop about deep reinforcement learning (DRL). First presented at ODSC London in 2023, it is nearly three hours long and covers a wide variety of topics. Split into three sections, the video introduces DRL and RL applications, explains how to develop an RL project, and walks you through two RL example notebooks.


Phil Winder shares experiences of Winder.AI’s Reinforcement Learning experience at a variety of large and small organizations.

This tutorial is all about deep reinforcement learning. You might have heard about it in the media, from its use in generative language models (reinforcement learning from human feedback) or more directly in one of the many applications of this fascinating technology. The goal of this tutorial is to give you a hands-on-ish walkthrough of what reinforcement learning is, why we need it to be deep, and how it’s used in practice. You will learn the background theory, explore use cases, and have fun with a notebook that provides a practical example of what we’re talking about. They’ll also be an opportunity for you to ask questions to find out more how we at Winder.AI are using RL in commercial projects.

Session Outline

Part 1: Introduction to Reinforcement Learning

In this section, Dr. Winder introduces reinforcement learning (RL) and its applications. RL is a class of algorithms that allows software agents to make sequential decisions and learn from their interactions with an external environment. The workshop highlights RL’s potential in automating strategic decision-making, addressing problems in robotics, industrial automation, and more. Key characteristics of a good RL problem, such as exploration, clear rewards, rapid feedback, and strategic decision-making, are discussed.

Part 2: Developing RL Projects

This part delves into the basics of DRL and its applications in solving Markov Decision Processes (MDPs). Dr. Winder explains the entities involved in RL problems, including the agent, the environment, and observations. The importance of policy engineering, state representation, and generalization is highlighted. Deep learning’s role in DRL and recommended RL frameworks like RLlib and stable-baselines3 are also discussed.

Part 3: Practical Examples

Two practical examples of DRL are demonstrated in this section. The first example involves using RL to alter the output of a language model to produce positive and nicer tweets. The second example focuses on the Monte Carlo algorithm and its use in solving RL problems in Grid World environments. The workshop emphasizes visualization for understanding and debugging RL algorithms.

Frequently Asked Questions (FAQ)

The web page includes an FAQ section covering various questions related to RL. These include topics like RL’s market size, its application to business strategy, its maturity compared to traditional machine learning, its effectiveness in different domains, and its resource and time requirements. It also answers questions about using RL to find bugs in games, specific applications of RL in gaming, and how RL can optimize user enjoyment in games. The flexibility of RL for different applications and learning resources for exploring RL further are also addressed in the FAQ section.

Learning objectives:

  • Understand what RL is and how it differs from ML
  • Appreciate why and when you should use RL
  • Evaluate the need for deep techniques
  • Explore the ecosystem of tools and a simple practical example of how to use them


The following transcript was generated with the help of AWS Transcribe and ChatGPT. Apologies for any errors.

Dr. Phil Winder, CEO of Winder.AI, provides an overview of a presentation he delivered at ODSC in London, which was not recorded. He has decided to represent the presentation online and split it into three parts due to its length. The sessions are available for interaction and viewers can post questions through chat features.

The primary focus of the presentation is to provide an understanding of Reinforcement Learning (RL). Dr. Phil Win draws parallels to how humans and animals learn, highlighting the importance of both curriculum-based learning and experiential learning. RL, much like trial and error learning in humans, helps to build predictive models of what actions are favorable. Examples include teaching pigeons and chickens through reinforcement, humans learning to swim, and the challenge of relearning to ride a bike with inverted controls.

Part 1: Introduction

In a business context, RL can be considered as a tool for automating strategies, the highest rung of business operations. The three layers of business operations are processes (lowest rung), decisions (middle rung), and strategies (top rung). While software can automate basic processes and machine learning can assist in decision-making, RL shines in the realm of strategy automation, which holds the highest value and impact for businesses. However, the market for RL is smaller than software because it addresses specific high-value problems.

Reinforcement learning (RL) is a powerful concept in the world of software and artificial intelligence. Developed from the foundation of Markov Decision Processes (MDPs), RL allows software agents to make decisions and learn from their interactions with an external environment. Dr. Phil Winder, an expert in the field, has made significant contributions to deep reinforcement learning, shedding light on its practical applications in various domains.

The Core Elements of Reinforcement Learning

At the heart of RL lies two essential entities: the agent and the environment. The agent can be thought of as the decision-maker, whether it’s a human riding a bike or a software program defining the underlying model. On the other hand, the environment encompasses everything external to the agent, influencing the outcomes of the agent’s actions. For instance, in the context of a person riding a bike, the environment comprises the bike itself, the surrounding world, wind, and other factors.

The agent’s primary objective is to generate actions, which are the decisions on what to do next. These actions can be simple or complex and have a direct impact on the environment. For instance, while riding a bike, actions could involve steering, pedaling, braking, or any combination thereof. The key is that these actions must affect the environment, enabling the agent to observe their effects.

However, direct access to the underlying state of the environment is often challenging or impossible. Instead, agents rely on observations to get a sense of the environment’s state. These observations may be partial representations, such as sensing the wind, feeling balance, or detecting user behavior in software applications.

Moreover, reinforcement learning heavily relies on the concept of rewards. After the agent performs an action and observes the environment’s response, it receives a reward that provides feedback on the quality of its action. A positive reward reinforces good behavior, while a negative reward discourages undesirable actions. Determining suitable rewards can be challenging, especially when dealing with complex problems or long-term consequences.

Characteristics of a Good Reinforcement Learning Problem

Not all problems are suitable for RL. A good RL problem possesses specific characteristics that align with the strengths of the learning approach:

Exploration and Exploitation: A good RL problem requires a balance of exploration and exploitation. The agent must explore different strategies to find an optimal solution while exploiting its current knowledge to make informed decisions.

Clear and Obvious Rewards: Problems with well-defined and easily discernible good or bad outcomes provide clearer reward signals, making it easier for the agent to learn.

Rapid Feedback: Prompt feedback on actions helps agents learn more efficiently. Long delays between actions and rewards can hinder the learning process.

Strategic Decisions Over Time: RL is particularly suitable for problems involving multi-step, sequential decisions. It excels in handling strategic planning across time.

Real-World Applications of Deep Reinforcement Learning

Deep reinforcement learning has found practical applications in various domains, showcasing its potential to solve complex problems:

  • Robotics: RL has been applied to control robots and automate tasks such as pick-and-place operations, furniture assembly, and even playing sports.
  • Automated Penetration Testing: RL agents can be trained to discover vulnerabilities in web applications, helping enhance security measures.
  • Industrial Process Automation: RL can optimize processes in industrial settings, such as paper manufacturing plants, by controlling valves and machinery efficiently.
  • Traffic Control: Reinforcement learning algorithms have been employed to control traffic lights and optimize traffic flow on road networks.
  • Autonomous Vehicles: RL is utilized to control self-driving vehicles, making strategic decisions on the road.

Part 2: Developing RL Projects

In this sub-section, we will dive into the basics of deep reinforcement learning and its application in solving Markov decision processes (MDPs). Dr. Phil Winder explains the key aspects of RL problems and how they are addressed using sequential decision-making processes.

What is an RL problem?

RL, or Reinforcement Learning, is a class of algorithms that aim to solve Markov decision processes optimally. At its core, RL involves making sequential decisions in an environment, where each action affects the environment, and the agent receives feedback based on those actions. The process continues in a loop, with the agent observing the environment and adjusting its decisions accordingly.

An MDP, or Markov Decision Process, defines the framework for RL problems. It represents a set of states, actions, and rewards, and each state transition depends only on the current state and action, exhibiting the Markov property.

The RL problem seeks to determine the optimal sequence of actions that lead to the best outcome in the given environment. RL problems are often sequential in nature, forming trajectories that can branch out into a tree-like structure as the agent explores more possibilities.

Strategic Decision Making and Stochasticity in RL

Strategic decision making is a key aspect of RL problems. While some RL problems may aim for a single perfect solution, many are more stochastic, with multiple optimal strategies depending on the situation.

Even in deterministic environments (where actions have fixed outcomes), there can be numerous strategies to achieve success. The RL agent continuously learns from its observations and determines which strategy is most optimal at any given time.

In non-stochastic environments, where random events can occur, RL becomes more complex. However, this stochasticity opens up even more possible strategies for the agent to explore and optimize.

Domains and Entities in RL

RL problems typically involve domains where entities (agents) perform actions based on observations. These entities could be physical (e.g., robots) or virtual (e.g., software agents). The key aspect of an agent is its ability to take actions and learn from them.

Real-life scenarios where people make strategic decisions are ideal candidates for RL problems. Additionally, industrial control systems, such as advanced process control in manufacturing, can benefit from RL to optimize and automate processes.

A bounded context in the environment defines the relevant information that needs to be observed and processed by the agent. Observations should reflect the changes caused by the agent’s actions, ensuring that actions have an observable effect on the environment.

Observations and State Representation

Observations are crucial in RL because the agent’s understanding of the environment is based on these inputs. It’s essential to represent observations in a way that conveys meaningful information about the state of the environment.

State or observation engineering is a critical step to reduce the dimensionality of data and improve the learning process. Various techniques, such as dimensionality reduction and feature selection, can be used to extract essential information while simplifying the input data.

Developing a suitable simulation or learning from historical data can help create meaningful observations. In some cases, human demonstrations or imitation learning can also be valuable in guiding the agent’s behavior.

Policy Engineering

Once we move past the observation phase in DRL, the focus shifts to the policy, which serves multiple crucial functions. Dr. Winder emphasizes three primary roles of the policy, not extensively discussed in the literature:

  • Observation Conversion: The policy takes observations as input and converts them into a useful format that it can process. This step often involves feature engineering to transform raw data into a representation suitable for the model.
  • Policy Model Learning: The policy model itself needs to be learned and possibly remembered. Various ML algorithms can be employed here, with deep learning being a popular choice due to its flexibility.
  • Action Conversion: The output of the model needs to be transformed into actionable decisions. For instance, in continuous action spaces, the model might output statistical distributions (e.g., mean and standard deviation) that require sampling to obtain the final action.

Challenges in Policy Engineering

One significant challenge in policy engineering is dealing with continuous observations. Since creating a single state for each continuous value is infeasible, some form of function approximation (like deep learning) becomes necessary. However, care must be taken to avoid overfitting, as bad generalization can severely impact the overall performance of the agent.

Moreover, defining appropriate rewards is crucial. Rewards should be simple and closely aligned with the desired business goals. Complicated rewards can lead to undesirable behaviors and make it challenging to understand agent performance.

The Importance of Generalization

Generalization plays a vital role in reinforcement learning, particularly when it comes to continuous states. If the agent fails to generalize well, it may follow a trajectory that leads to failure states and negatively affects the entire run. Ensuring good generalization is vital for successful DRL implementations.

Going Deep with Deep Learning

Deep learning is a valuable tool in DRL, especially when dealing with continuous observations and complex domains like video inputs. However, it’s essential to remember that deep learning is just one of many modeling techniques available. It is not a one-size-fits-all solution, and other RL frameworks can be equally effective in different scenarios.

For serious RL development and production projects, RLlib, built on top of Ray, is a popular choice. Ray is a distributed computational framework that simplifies managing experiments at scale. RLlib provides a robust set of reinforcement learning algorithms and interfaces well with Ray’s capabilities.

For more experimental purposes and algorithm tinkering, stable-baselines3 is an excellent option. It is a Python implementation of RL algorithms based on a project by OpenAI and offers easy-to-understand code and a variety of examples to get started.

Part 3: Practical Examples

The code for this section can be found on Github.

Example 1: Improving Language Model Sentiment

The first example involves using reinforcement learning to alter the output of a large language model. Dr. Winder uses a language model trained on Donald Trump’s tweets from the repository “text RL” by Eric Lamb. The goal is to make the model produce positive and nicer tweets.

The process begins with downloading the necessary tokenizer and model weights. Dr. Winder also employs a sentiment analysis model to determine whether the generated tweets are positive or negative. By applying reinforcement learning, the model receives rewards based on the sentiment analysis and fine-tunes its output to generate more positive content.

The RL environment is crucial in any RL problem. Dr. Winder showcases a text-based RL environment with pre-implemented functionality that users can explore further. The focus of this example is on understanding the reward function and RL algorithms.

The RL agent in this example uses the Proximal Policy Optimization (PPO) algorithm, which is robust and well-suited for online training. Dr. Winder goes through the training loop, where the agent interacts with the environment, receives rewards based on its behavior, and updates its policy to maximize rewards.

The results are intriguing as the model gradually starts producing more positive tweets. It showcases how reinforcement learning can be used to alter the behavior of language models, allowing for the generation of desired outputs.

Example 2: Monte-Carlo RL Algorithms in Grid-World Environments

In this section, we will explore the basics of Reinforcement Learning (RL) and focus on the Monte Carlo algorithm as a way to solve RL problems. RL is a framework that allows us to model and optimize sequential decision-making processes, often known as Markov Decision Processes (MDPs). The goal of RL is to enable an agent to learn strategies or techniques to perform better in a given environment.

RL involves two key entities: the agent and the environment.

  • The Agent: The agent is our RL model or algorithm, which learns to make decisions and take actions to achieve specific objectives.
  • The Environment: The environment is everything external to the agent, and it provides feedback in the form of rewards based on the agent’s actions.

One of the classic RL simulation environments is the Grid World, resembling a chessboard or checkerboard. The agent can move from one square to another in various ways, and the objective is to reach a terminal state while avoiding certain states (holes) that lead to failure.

Components of RL: State, Action, and Reward

  • State: The observation of the environment at a given time. In the Grid World, we can see the agent’s position.
  • Action: The agent’s choice of action at a specific state, which affects the environment.
  • Reward: The feedback provided by the environment to the agent, indicating the quality of the chosen action. In the Grid World, a constant negative reward is used to encourage reaching the terminal state in as few steps as possible.

The Monte Carlo algorithm is a basic RL approach for solving MDPs by randomly sampling trajectories. It involves the following steps:

  • Generate Trajectories: Repeatedly run episodes by randomly selecting actions and collecting state-action pairs until reaching a terminal state or a maximum step limit.
  • Calculate Average Action Values: Compute the average rewards obtained by performing specific actions in each state.
  • Derive Optimal Policy: Select the action with the highest average action value for each state as the optimal policy to follow.
Visualizing RL Solutions

Visualizing the RL process and results is crucial for understanding and debugging the algorithms. Using visualizations, we can observe the agent’s progress, average values, and derived policies, which can help identify and fix bugs in the simulation.

Challenges and Next Steps
  • Fixing Bugs: The first step in refining RL solutions is to identify and fix any bugs in the environment or code.
  • Adding Complexity: Experiment with different environment sizes, reward functions, and obstacles to observe how they affect the RL algorithm’s performance.
  • Exploring Advanced Algorithms: Move beyond the Monte Carlo algorithm to explore more efficient and advanced RL algorithms.
  • Realistic Environments: Consider developing RL environments tailored to your specific domain, step-by-step, to ensure proper functionality and problem-solving capabilities.

Frequently Asked Questions

Frequently asked questions

More articles

A Code-Driven Introduction to Reinforcement Learning

In this presentation I present a code-driven introduction to RL, where you will explore a fundamental framework called the Markov decision process (MDP) and learn how to build an RL algorithm to solve it.

Read more

Part 1: Introduction to Large Language Models and ChatGPT

Explore the workings of ChatGPT and large language models (LLMs) in AI, their creation, applications, and impact. Part 1 in this series introduces the basics.

Read more