Testing and Evaluating Large Language Models in AI Applications

Jul 17, 2024

by Dr. Phil Winder , CEO

With the rapidly expanding use of large language models (LLMs) in products, the need to ensure performance and reliability is crucial. But with random outputs and non-deterministic behaviour how do you know if you application performs, or works at all?

This article and companion webinar offers a comprehensive, vendor-agnostic exploration of techniques and best practices for testing and evaluating LLMs, ensuring they meet the desired success criteria and perform effectively across varied scenarios.

Learn how to create test cases and how to structure prompts for evaluation
Discover a variety of evaluation techniques, from LLM-focussed metrics to extrinsic testing methodologies
Analyze evaluation results and ensure consistency and reliability across different inputs
Use a mental framework to help organize the different components

Target Audience

This webinar is perfect for those that are interested in or are already using LLMs within their systems. To get the most out of it, I recommend that you should be familiar with language models. If you need a quick recap, one of our previous introductory articles will help you.

Download Slides

Challenges and Failure Points of LLM Applications

Last month’s article about RAG use cases concluded with a slide about how hard it is to test and evaluate LLMs. The key challenge is that LLM predictions are inherently noisy. The output is intentionally random and non-deterministic. Human language is also so complex that it’s very hard to define what is “correct”.

To put it succinctly, testing LLMs is hard.

This article investigates how to test and evaluate LLMs in AI applications.

RAG challenges

What are the Differences Between Testing and Evaluating LLMs?

Testing and evaluation are often conflated. But they have different purposes and target audiences. The differences are summarized in the table below:

	Testing	Evaluation
Purpose	Ensuring the system works	Measuring the performance of the system
Audience	Application Developers	Model Developers
Focus	Validity of a single use case	General capability of the model
Stability	Locally robust	Stable on average
Variability	High but across one domain	Low but across many domains

The purpose of testing is to ensure that the system works. It’s focused on the validity of a single use case. It’s locally robust in the sense that it’s good at proving that the system works for a specific problem. It achieves this by testing the narrow system in a variety of ways.

Evaluation, on the other hand, is focused on measuring the capabilities of a model. It’s more general and is aimed at model developers. It’s stable on average, because model developers use benchmarks to direct their development. You can’t be sure that the model will perform well on your specific use case, because the evaluation is performed across very few tasks across many domains. However, it’s reasonable to assume that if you improve the model’s performance across the benchmarks, you will also improve its performance on your use case.

That’s not to say that testing and evaluation are mutually exclusive. They are complementary. Testing often uses evaluation metrics to measure performance to ensure the system works as expected.

To summarize:

Evaluation – Measuring the performance of a model against task-specific metrics, usually across many domains.
Testing – Validity of task within a specific domain in varying conditions.

Now we can tackle each element in turn.

Evaluation

The goal of this section is to introduce you to the evaluation of LLMs, from the perspective of a developer building an application.

The first question you need to ask yourself is “what is the task?”

I’m not asking what the use case is here. In general use cases don’t have evaluation metrics because they often comprise of multiple tasks. E.g. a business chat bot. This is not true when it comes to testing, because tests are supposed to ensure that the system succeeds in the use case.

If you have a background in machine learning, you might be familiar with the concept of evaluation metrics. These are used to measure the performance of a model. But LLMs are different. They are used for higher-level, knowledge-based tasks, like question answering, summarization, and generation. These tasks are more complicated and require more nuanced evaluation metrics.

But even though they are probably overkill, language models are still useful for traditional tasks, like classification. Some traditional evaluation metrics include: accuracy, recall, precision, F-score, etc.

You can consider these metrics the low-level building blocks of evaluation.

The Problem With Traditional Metrics

But LLMs are mostly used for higher-level, knowledge-based tasks. These tasks require evaluation metrics that calculate the quality of the output as much as the correctness. If you dig into each of these metrics, you’ll find that they are related to low-level metrics. For example, relevance is related to recall and precision, correctness is related to accuracy, and so on. For example:

Question Answering: relevance, hallucination, correctness, etc.
Summarization: alignment, coverage, relevance, etc.
Generation: functional correctness, syntactic correctness, etc.
and so on.

But these are still highly task specific. For example, consider the difference between generating JSON and SQL. The evaluation procedure for these two tasks would be very different. So while you could create metrics for each of these tasks, doing so for the sake of it isn’t scalable.

The Need For Aggregations

Since there’s so many ways that you could evaluate a response, aggregations of metrics have emerged to capture specific human-like traits or specific use cases.

For example:

BLUE: evaluates “quality” compared to a reference text, which is built from an average of precisions
ROUGE: is an average recall-like metric for summarization

And then others then combine the aggregations into even higher level abstractions!

HellaSwag: evaluates for “common sense”
TruthfulQA: evaluates for “truth”

And so on.

You end up with a tree of metrics that you could use. But they’re all rooted in the task.

Practical Approach

Ideally, you want to browse a list of all possible metrics. Then you rank and pick metrics from that list that correlate with your task. You could then use that to benchmark several different models over those metrics.

If you can’t you’ll need to create your own metrics. For example, for one of our recent clients the golden metric was “percent of predictions that were correct to within +/- 3 minutes” since in that domain, within three minutes was equivalent to perfection.

This changed the focus of the development of the model. We then concentrated more on outliers than we did on average performance.

Available Tools

If that doesn’t sound like fun, or if you’re short on time, then it might be easier to use off-the-shelf metrics. There’s a variety of tools that provide various industry standard metrics that you can plug in and use as a proxy.

This is a reasonable approximation because if you improve those metrics, it’s quite likely that you will also improve your business metric.

Here are a few evaluation frameworks that might help you in your journey:

https://github.com/confident-ai/deepeval - a collection of test-oriented helper methods to do evaluation
https://github.com/openai/evals - more of a benchmark, but has the code to do the evaluation
https://docs.arize.com/phoenix/evaluation/llm-evals - more of a platform, but again has the capability to do the evaluation

What Are LLM Benchmarks?

Benchmarks are combinations of metrics and testing data. Typically they include a variety of high-abstraction metrics like “truthfulness” and “task understanding” and a large number of human curated answers.

Benchmarks are typically aimed at people that are developing, training, or fine-tuning models. More and more people are fine-tuning models and benchmarks are a great way to ensure you’re not destroying the general capabilities of your base model.

LLM Leaderboards

Benchmarks are most commonly used to power LLM leaderboards. These help you decide which base model is better.

But benchmarks are publicly available, which means that they are available to the model developers too. All of the models at the top of the leaderboards are likely to have been trained upon the benchmarks that are trying to evaluate them.

The benchmarks attempt to fight back with obfuscations and randomisation, but the general idea is that public evaluations gamify model development.

Hugging face has a leaderboard that presents many metrics for opens source language models.

The Need For Human Evaluations

The problem of overfitting the benchmarks has lead to a new era of human-powered performance benchmarks which pit one model against another in a battle. A human (we assume!) then gets to decide which model has given the best response. The results are aggregated using an Elo rating system to produce a final “skill level” metric.

LMSYS Chatbot arena is one of the most famous implementations of this. You can see the current LLM leaderboard here. As of writing, GPT-4o just pips Claude 3.5 and Gemini-1.5 with a score of 1287.

For reference, Llama-3-8b-Instruct is in 31st place with a score of 1152, just behind GPT-4-0613. This is my most commonly used general purpose open source model that I use in my work.

The problem with this metric is that it’s too course and it’s not expressive enough to distinguish capability. The metric is used to provide a ranking, not measure any objective performance measure. For example, what does it mean for Llama-8b to be “10” behind GPT-4. Is that good? Or is that a long way off?

Evaluation Conclusion

Evaluation is rooted in the choice of metric you choose for your particular task. But use cases are often more complicated that simply achieving the best average performance for a particular task. You may have non-functional requirements regarding resource usage or latency. And the use case may have constraints that you must ensure.

Evaluation techniques are only appropriate for measuring aggregate performance. When developing an application you should be constraining the LLM to ensure it does what your use case requires it to do, and no more. Robustness is the key driver here and is not only task specific, but use case specific to boot.

Testing

To recap, testing aims to ensure that a system works as intended. It’s much more bespoke to your use case and domain. It is unlikely that there are publicly available test suites for you to use. When it comes to language models, testing is hard because we want the output to be diverse and capable. Testing that is not as simple as checking for equality.

Prompt Driven Development

To test a traditional software application you would write scenarios that must succeed, otherwise it is considered broken.

This is a simple rubric that sadly does not work with AI applications because the underlying data is inherently noisy. Instead, you must test a range of scenarios to increase your confidence that the solution is most likely working.

Engineers and managers alike often squirm at this definition because it leads to the obvious conclusion that your application is broken at least some of the time. Trying to communicate that to your users is difficult. An uncertain world produces uncertain results.

Still, you can develop your AI application in such a way to give you a high degree of confidence that it’s working well enough. And that often leads to usefulness and therefore value. So all is not lost.

If we constrain our analysis to just language models for a moment, what does this involve?

Prompt Engineering and Testing

Language models have a surprisingly simple interface. Text in, text out. Your goal is to ensure the text out produces the expected results for a given input. Sounds simple right? Wrong.

The input is unconstrained. Your input could be the sum of all human knowledge, which means the output could literally be anything.

The input is typically split into two key components: the system prompt and the user prompt. The system prompt is fully under your control. You potentially have some control over the user prompt but that is typically left to the mercy of an external system.

Constraining the user prompt is an undervalued technique. This could be as simple as some user experience that restricts what or how the user is expressing the data. For example you could constrain it to input only JSON.

A more sophisticated approach might be to transform or mutate the input so that it better conforms to what you are expecting. You could even use another language model as a kind of gatekeeper.

This means that most of your time is spent refining the system prompt to pre-empt and counteract the damage the user could do.

You could develop your countermeasures through trial and error. But in the long term you should be properly testing your prompts and ideally developing using test driven (prompt) development.

How to Decide What to Test

One initial struggle with both software and language models is deciding what to test.

Like in software, I (controversially) like to develop the application/prompt first to gain ad-hoc experience. Only then do I go back to the use case and imagine how the user might use it.

This is because I often find that something pops out that I hadn’t thought of. There might be technical limitations, the problem might be unviable, or more likely, it highlights problems or a lack of a problem definition. If I don’t do this I often find that I spend a fair amount of time writing tests only for them to go unused.

Once you have honed in on the problem definition and the use case, then you need to generate some examples.

Ideally you want your users to do this for you, so you don’t add your biases. Quite often that’s not possible. An LLM is also often quite good and helping you generate or find alternative use cases.

Things You Might Want To Test

The tests that are right for your application entirely depend on your use case. To help you get started, the list below presents a selection of tests that might be important to you:

latency
contains X (e.g. a string, some json/sql, a name, etc.)
confidence in its own output – a.k.a. logprobs greater than or perplexity less than
factuality (usually via another model)
relevance/recall (usually via another model, but you could use a direct metric)
toxicity/harm/moderation (as above)
similarity
calls a function
resultant code is syntactically correct/compiles/works/passes/etc.
classifies correctly
does not contain personally identifiable information

These are known as test properties. To test a property you use an eval designed to evaluate that property. An eval is a metric designed to test a specific property.

Note that all of this is very closely related to monitoring. Everything we are talking about here could be applied to monitoring too. You should look at consolidating both approaches so you don’t repeat yourself too much. See related topics for more discussion.

LLMs Are Better At Classification Than They Are Generation

Language models are better at classification than generation. The reason for this, in a nutshell, is that the the task is much narrower in scope compared to generation. And narrow problems are easier to solve. Thinking of this in reverse, generating an example of a class requires a large amount of contextual information and planning. So unless you have a highly detailed and prescriptive prompt, it’s difficult for a language model to generate perfect examples.

I haven’t found a good citation for this, but I have seen this mentioned.

What this means is that even though a language model might not be great at doing the job you really want it to do, it probably is good enough at judging the output. There are exceptions to this of course. Just make sure you consider how hard the task is. The easier it is for you to do, the easier it will be for the language model.

In summary, you can use a language model to evaluate the result of a test.

How to Test Prompts

Now we have all the tools to build a suite of tests to help develop and test your prompt. To do this begin by:

Convert the business problem definition into distinct use cases
For each use case, split it into tasks. Quite often there is only one task per use case, but more complicated use cases may require multiple tasks.
For each task, test properties that fulfil the requirements of the task.
For each property, create a test input to use in a test case. Use a language model to generate examples.

Organize these test cases in a hierarchy to help you manage them.

Now that you have a list of test cases to evaluate, decide on the best mechanism to run it. Ideally, tests should be as fast as possible, because they directly affect development velocity.

Using code to do the evaluation is the fastest way of achieving this. This is perfect for properties that don’t require any interpretation like latency. But many of the more interesting properties require a more comprehensive assessment.
Human evaluations are the gold standard, but don’t scale well. You might want to consider doing a few as a smoke test.
The slowest, but most flexible automated approach is to use another language model to programmatically evaluate whether a test case passes.

Anthropic have a nice notebook example of these three methods.

Double Down on Failures

For every failure, try to improve the system as a whole to make that test pass. This might involve improving the system prompt or it could be a matter of mutating the input. Either way, the idea here is to slowly improve your solution to fulfil your requirements.

If you do have persistent failures, it’s worth digging down to find more examples that error that are slightly different. So for each failure, ask a language model to generate another example that your system will fail on.

You can also prompt the language model to tell you why it’s failing. Use that information to make your system even more robust.

Tools To Help You Test

Ultimately you will want to test a range of these these properties in a systematic way. There’s a few tools out there that help you do that, but they’re all basically domain specific libraries or languages that allow you specify an input and the expected output.

The summary in the previous section was inspired by Microsoft’s adaptive testing app (AdaTest). This is an OpenAI tool with a frontend to help you manage and develop tests.

PromtFoo is a very similar tool and equally as useful.

If you’re building applications, not just prompts, you probably also want to write your own scripts to run your tests in an automated fashion, for CI/CD purposes. Keep your script and tests close to your codebase and easy (and fast!) to run.

You’ll obviously need access to a large language model and a playground to do prompt development. You can take advantage of a hosted provider or build a solution yourself from the likes of: langchain, llamaindex, something for end-user-focused like Chainlit, or something more MLOps-y like Arize.

Related to the playground idea is the use of spreadsheets to organize prompt experiments and API connections to commercial model providers. For example, claude or gpt for sheets for Google Sheets.

If you’re part of a larger team, then you might want to start unifying or organizing your prompts centrally with something like PromptLayer or W&B Prompts.

Conclusion

Evaluation is most useful for those that are developing, training, or fine-tuning models. You can use benchmarks to ensure that, on average, your model performs well, or at least doesn’t perform much worse.

But if you are building a production application on top of a language model, then you need to have confidence that it works. You can do this by developing tests that take a wide range of user inputs and evaluating the output.

Continuous Integration

The logical extreme of both of these is a CI pipeline. The first stage in the pipeline evaluates the actual model you use in production to ensure that performance doesn’t change over time due to model updates or benchmark changes. The second stage tests your use case against that model to ensure that the system as a whole (including system prompts, input guards, etc.) works as expected.

This process is not cheap, both in terms of development and execution time, but it offers unparalleled production quality confidence in your application.

I decided not to include monitoring to reduce the scope of the article. But there’s obvious parallels between evaluating and testing offline whilst building an application and doing something similar online when the application is in production.

I’ll leave this for another day but here’s one link that begins to discuss the nuances when attempting to evaluate online.

Further Reading

A really insightful article that acted as a launchpad for the rest of my research by Marco Tulio Ribeiro (author of adaptive-testing)

More articles

LLMs: RAG vs. Fine-Tuning

Mar 13, 2024

When should you use retrieval augmented generation (RAG)? When should you fine-tune? Find out when and why and how to incorporate knowledge into LLMs.

Scaling StableAudio.com Generative Models Globally with NVIDIA Triton & Sagemaker

Apr 10, 2024

Learn from the trials and tribulations of scaling audio diffusion models with NVIDIA's Triton Inference Server and AWS Sagemaker.

}