AI Confidence: Evaluating AI Models with Synthetic Data

Nov 19, 2025

by Dr. Phil Winder , CEO

Building AI-powered applications is exciting, until you need to evaluate whether your model actually works. How do you test an AI system that gives different answers every time? How do you gather enough test data to be confident in your results? This one-hour practical webinar will transform you from uncertain to confident about AI evaluation.

Overview

I think the hardest part of building AI applications is testing and evaluation. When so much of your application depends on the underlying responses of an LLM, how can you be sure that it’s working?

And it’s far to easy to hack away thinking you’re improving your application, by improving a prompt here, or optimizing an LLM call there. But how do you prove you’re helping, not hindering?

This content is based upon a day-long workshop I delivered at GOTO Copenhagen 2025. It covers the fundamental shift in how we approach testing when working with non-deterministic AI systems.

Key Resources:

The Core Problem: Non-Determinism in AI

Traditional software engineering relies on deterministic tests with clear inputs, outputs, and assertions. Some approximation of test-driven development (TDD) has been the standard approach for decades.

But AI-driven systems fundamentally change this paradigm. LLMs are fuzzy black boxes. When you pass something in you can’t be certain what we’ll get out.

A simple demonstration: running the same prompt (“Explain why Paris is an important city in 10-15 words”) ten times through a local Ollama instance with the default settings produces subtly different results each time.

Controlling Output Variability

There are parameters under your control that increase the level of determinism.

Key Parameters for Controlling Output

Temperature:

Controls randomness in model output
Lower values (0.1-0.3) = more deterministic, consistent responses
Higher values (1.0+) = more creative, varied responses
Think of it as flattening or sharpening the probability distribution of next-token selection

Top P (Nucleus Sampling):

Limits the model to only consider tokens representing the top P percent of probability mass
Low values (0.1) = very consistent, only most probable tokens
High values (0.95+) = allows more diversity, can lead to hallucination or language switching
Warning: High top_p with multilingual models (like Qwen) can cause unexpected language switches

Top K:

Selects only the top K most probable tokens
More predictable than top_p since it’s a fixed count
Top K of 2 means only the two most likely tokens can be chosen

Repeat Penalty:

Controls how often tokens repeat within a sentence
Values around 0.75 allow repetition
Values around 1.5 discourage repetition
Higher values produce more concise outputs

Num Predict / Max Tokens:

Limits response length
Particularly useful for classification tasks where you expect short outputs (e.g., “true” or “false”)

Setting the Random Seed Is Not Enough

You can set a random number generator seed in most APIs to make outputs reproducible on the same system. However, seeds alone aren’t sufficient because:

Different hardware architectures produce different results
CPU vs GPU vs TPU inference runtimes vary
Floating-point precision differences in chipsets cause cascading effects on model parameters
Switching model providers (e.g. OpenAI vs. Azure OpenAI) have implementation differences
Docker vs local counts as an architecture change and affects results.
Switching frameworks (same model!) produces different results

Bottom line: Even with temperature=0, top_k fixed, and seed set, you cannot guarantee identical outputs across different environments.

Guidelines by Task Type

Factual/Analytical Tasks (classification, Q&A, consistent results needed):

Temperature: 0.1-0.3
Top P/K: 0.3-0.5
Focus on consistency

Creative Tasks (text generation, content creation):

Temperature: near 1.0
Top P: higher values, but be careful not to introduce hallucination
Allow flexibility, but monitor for quality degradation

Balanced Tasks (coding assistance, some creativity with data-backed approach):

Temperature: 0.4-0.5
Top P: around 0.5
Balance creativity and reliability

Context Length Considerations

Models have maximum context length limits for inputs and outputs
Self-hosted models: context length dramatically affects memory usage
Recommendation: Minimise prompts and expected outputs where possible

Managing Hallucination

Hallucination isn’t purely negative. It’s the other side of the creativity coin. We can’t simultaneously demand “don’t hallucinate” and “be creative.” Evaluation is about finding the right balance.

Models hallucinate when asked to do things they have no knowledge of or lack context for. Example: Asking about the fictional “Dr Samantha Rodriguez, Nobel Prize winner in AI for 2025” can produce confident but fabricated biographical details.

Techniques to Reduce Hallucination

1. Instruct the model to acknowledge uncertainty: Add simple instructions like “If you don’t know, say so.” This is a no-brainer addition to virtually every prompt.

2. Ask the model to verify its own statements: Have the model apply verification before presenting answers. This guides output prediction toward correct answers.

3. Leverage thinking models: Reasoning models implicitly think before responding, which improves accuracy. The output tokens act as a guide through a maze toward the correct answer.

4. Include citations:

Especially valuable with RAG systems or external information retrieval
Ask the model to write out citations/verified facts first, then generate the answer
Even labelling things as “fact” helps ground responses

5. Chain of thought prompting: An extension of the verification approach — have the model reason step-by-step.

6. Appropriate temperature settings: Lower temperatures for factual tasks where consistency matters.

Evaluation Methods

Benchmarks

Benchmarks are curated datasets with scripts to quantify LLM performance. They typically:

Contain question-answer pairs with expected outputs
Use another LLM as a judge to determine correctness
Measure results quantitatively

Human evaluation is another form: websites exist where people compare qualitative results from various models.

Evaluation Frameworks: PromptFoo

PromptFoo is a powerful Node-based evaluation framework that provides:

A YAML configuration format for defining tests (CI-friendly)
A web UI for viewing results
Support for multiple providers, prompts, and test cases
Both deterministic and LLM-powered assertions

GitHub: https://github.com/promptfoo/promptfoo

Setting Up PromptFoo

Install globally via npm:

npm install -g promptfoo

YAML Configuration Structure

providers:
  - ollama:qwen3:1.7b  # List of models to test

prompts:
  - "Your prompt with {{variable}} placeholders"

tests:
  - vars:
      variable: "test input value"
    assert:
      - type: contains
        value: "expected_word"

Assertion Types

Deterministic assertions:

equals — exact match
contains — substring match
contains-json — valid JSON check
Custom JavaScript functions for complex logic

LLM-powered assertions:

Factuality checks
Semantic similarity
Custom model-as-judge evaluations

Running and Viewing Results

promptfoo eval              # Run evaluations
promptfoo view              # Launch web UI to view results

The UI shows:

Pass/fail status for each test
Variable inputs and model outputs
Comparison across different prompts/providers
Historical run comparisons
Filters for passes/failures

Synthetic Data Generation

Why Generate Synthetic Data?

Problem 1: Test coverage burden Applications with diverse use cases require many test cases. Manual creation is time-consuming.

Problem 2: PII concerns Real data often contains personally identifiable information that shouldn’t be in tests.

Approaches to Generating Synthetic Data

Simple approach: Use your favourite chat LLM provider to generate test data manually. Copy into the right format for PromptFoo.

Script-based approach: Write a script that generates data based on predefined categories/seeds.

Example workflow for email classification:

Define classification categories (urgent, supply, schedule, customer, maintenance, other)
Create seed descriptions for each category
Have an LLM generate realistic emails for each category
Output in PromptFoo-compatible JSON format
Assert that classifications match expected categories

Resources

Using Synthetic Data in Evaluation

Configure PromptFoo to read test cases from a JSON file:

tests: synthetic_emails.json

This allows rapid iteration during prompt development, revealing which categories perform well or poorly.

Example finding: Running classification tests might reveal that “maintenance” and “other” categories perform poorly while “urgent,” “supply,” and “customer” work well. This directs prompt refinement efforts and might even suggest revisiting the classification taxonomy itself.

Practical Recommendations

Start Small Gain confidence with smaller datasets and simpler “happy-path” end-to-end examples. Wide coverage is better than edge-cases at this point.

Build Up Gradually Review and write examples by hand if you can, to begin with, to improve your domain expertise. Start adding edge cases. Refine your prompts.

When to Add Real Data Once you’re happy with your manual prompts, incorporate real data if you have it. Always avoid adding too many tests which only add to the maintenance burden.

When to Add Synthetic Data Once you’re happy with your manual prompts, add synthetic data to add coverage and tackle edge cases. There’s a risk that adding too much synthetic can lead testing and maintenance burdens.

For On-Premise Requirements

If working with private or sensitive data, consider on-premise AI platforms like Helix.ml that allows running AI workloads entirely locally on your own hardware or private data centres.

Key Takeaways

Non-determinism is inherent — you cannot eliminate it, only manage it
Parameters help but aren’t enough — temperature, top_p, top_k, and seeds provide control but not determinism across environments
Prompt engineering reduces hallucination — instruct uncertainty acknowledgment, request verification, use citations
Evaluation is essential — tools like PromptFoo enable systematic testing and comparison
Synthetic data accelerates development — generate diverse test cases to identify weak spots in your prompts
Iterate based on data — use evaluation results to guide prompt improvements rather than hacking blindly
Start simple, add complexity as needed — avoid over-engineering tests that become maintenance burdens

More articles

Leading with AI: A Strategic Playbook for CEOs and CTOs and Building Sustainable and Secure GenAI Systems

Sep 24, 2025

A talk that provides strategic level advice on how to acheive success with GenAI projects.

How to Build AI Agents for Data Analytics: A Practical Guide with Python

Jul 30, 2025

Learn how to build AI agents that answer data questions using natural language. Includes Python code examples and BigQuery integration tutorial.

}