Automating Cyber-Security with Reinforcement Learning

by Dr. Phil Winder , CEO

The best way to improve the security of any system is to detect all vulnerabilities and patch them. Unfortunately this is rarely possible due to the extreme complexity of modern systems.

The common suggestion is to test for security, often leveraging the expertise of security-focussed engineers or automated scripts. But there are two fundamental issues with this approach: 1) security engineers do not scale, and 2) scripts are unlikely to cover all security concerns to begin with, let alone deal with new threats or increased attack surfaces.

Reinforcement learning (RL) is a subset of machine learning, which optimizes problems to perfect strategies through trial-and-error. An RL agent, the embodiment of a RL model, can be taught to utilise strategies like a human can, to achieve a goal.

So for example, we could train an RL agent to perform a penetration test or “hack” an API. Organizations can take advantage of this technology to proactively and reactively test that their applications are secure (or at least as secure as the capabilities of the agent).

In this article I present three key areas within cyber-security that are benefiting from RL, which my team and I are able to deliver as point-solutions.

My New Book on Reinforcement Learning

Do you want to use RL in real-life, business applications? Do you want to learn the nitty-gritty? The best practices?

We've written a book for O'Reilly on Reinforcement Learning. It focuses on industrial RL, with lots of real-life examples and in-depth analysis.

Find out more on

Web Application Firewalls

One important threat originates via payloads from the public internet, where the attacker mutates requests to discover and exploit vulnerabilities.

Web application firewalls (WAF), or in other words, curated firewalls for a specific application, are used to detect and block suspicious behaviour. These are often rules based and when they detect nefarious activities they mitigate against potential damage.

However, WAF effectiveness entirely depends on the ability to detect whether payloads are harmful.

Defining “harmful” is a moving goalpost, because attackers are constantly trying to find new patterns that evade detection.

WAFs and the applications they protect are particularly vulnerable to attack because of the sheer complexity of highly expressive languages like SQL.

One solution to this problem is to build an autonomous agent that is capable of proactively attacking a WAF until it becomes exploitable. Such an agent can generate malicious payloads and learn the weaknesses of the current WAF configuration.

Open-source WAF implementations like ModSecurity provide an industry-standard proving ground which can be incorporated into a simulation; often referred to as a Gym, after OpenAI’s Gym interface.

With a convenient simulation, RL agents are able to automatically develop effective strategies to circumvent WAF policies and plausibly affect running systems. Agents could also target application-code with a similar approach.

Operators and developers would then evaluate the results of the agent and implement changes to reduce or remove gaps in the deployed configuration and application.

Pre-Compromise and Penetration Testing

Modern IT infrastructure and applications are designed, implemented, and maintained by domain experts; not security experts. Securing these assets has become a full-time role due to the ever increasing number and scale of external attacks. The fast pace of modern software development means that these engineers are in high demand and are often a bottleneck.

One solution is to attempt to automate security processes and testing. Penetration testing, in particular, is ripe for automation because the process of searching for vulnerabilities is labour and skill intensive.

Of course, targeting depends on specific application implementations, so most current research focuses on the learning of optimal attack patterns in an abstracted environment, rather than on industrial tools.

A more tractable solution might be to train an agent to schedule and direct security experts towards significant risk; in other words turn it into a prioritized scheduling problem.

But given time and engineering effort, applying automated penetration testing to specific problems is quite feasible.

For example, you could easily model the actions and subsequent movement through web content, which means you could also automate the attempted penetration of dynamic web content.

One example result could be an agent that learns that it is possible to visit some abandoned test page and then POST a request to wp-login/index.php, which results in admin access.

Adversarial Attacks - Attacking the Attacker

Traditionally, adversarial training is employed to trick another system into believing a machine learning algorithm is as good as, if not better than, a human performing some task, like painting or answering questions. But in the context of security, adversarial ideas are particularly attractive.

Imagine a software-defined network, like those used in most Kubernetes implementations. The networks that applications use aren’t directly coupled to physical switches and patch cables. They are entirely defined in software; in an emulation.

Software defined networks provide the multi-pronged benefit of flexibility, dynamism, and encapsulation. But attackers could, if they had access, leverage the same mechanisms to redirect traffic to an external public server or eavesdrop on all communications, for example.

An internal, friendly, RL-based agent is equally capable of using the same techniques to block and trap an attacker on the network in a virtual prison.

This might sound like Tron, I know! But this is entirely feasible and has been recently demonstrated in a simplified environment.

The idea also applies to any other type of attack.

For example, I can imagine a WAF detecting an attack and dynamically creating an SQL honey-pot, or even a reverse bot-net that performs DoS attacks on the attacking bots.

However, this raises the intellectually intriguing but scary prospect of an RL-driven arms-race, with attacks and counter-attacks creating a virtual dystopia.

Act Now

In reality, I think that an integrated narrative, one with multiple, best of breed security tools, interacting via open APIs, is the future. Agents could then easily block attacks at the firewall level or by altering WAF rules, thereby preventing attacks at the source.

And, if applicable, agents can feed information back to downstream applications and CI/CD pipelines and upstream open-source projects to ensure that vulnerabilities are checked and patched quickly.

Being at the forefront of RL-driven cyber-security is also a competitive advantage. Investing in automating the discovery of security holes and developing active countermeasures is dwarfed by the costs associated by a successful hack.

My colleagues and I at Winder.AI are here to help you find the best data-driven solution to your problem where RL may (or may not) be an integral part. Please contact us if you’d like to learn how RL can help your business, or indeed we’d love to talk about any data-oriented problem that you have.

If you’re interested in learning more about RL, you can find a wealth of information in our book.

Further Reading

Below is a collection of links that I used during my research of this article that you may also find useful:

More articles

Build a Voice-Based Chatbot with OpenAI, Vocode, and ElevenLabs

Learn to create a chatbot using OpenAI, Vocode, and ElevenLabs for natural voice interactions. An example speech-to-text and text-to-speech system.

Read more

Revolutionizing IVR Systems: Attaching Voice Models to LLMs

Discover how attaching voice models to large language models (LLMs) revolutionizes IVR systems for superior customer interactions.

Read more