AI Confidence: Evaluating AI Models with Synthetic Data

by Dr. Phil Winder , CEO

Building AI-powered applications is exciting, until you need to evaluate whether your model actually works. How do you test an AI system that gives different answers every time? How do you gather enough test data to be confident in your results? This one-hour practical webinar will transform you from uncertain to confident about AI evaluation.


Overview

I think the hardest part of building AI applications is testing and evaluation. When so much of your application depends on the underlying responses of an LLM, how can you be sure that it’s working?

And it’s far to easy to hack away thinking you’re improving your application, by improving a prompt here, or optimizing an LLM call there. But how do you prove you’re helping, not hindering?

This content is based upon a day-long workshop I delivered at GOTO Copenhagen 2025. It covers the fundamental shift in how we approach testing when working with non-deterministic AI systems.

Key Resources:

The Core Problem: Non-Determinism in AI

Traditional software engineering relies on deterministic tests with clear inputs, outputs, and assertions. Some approximation of test-driven development (TDD) has been the standard approach for decades.

But AI-driven systems fundamentally change this paradigm. LLMs are fuzzy black boxes. When you pass something in you can’t be certain what we’ll get out.

A simple demonstration: running the same prompt (“Explain why Paris is an important city in 10-15 words”) ten times through a local Ollama instance with the default settings produces subtly different results each time.

Controlling Output Variability

There are parameters under your control that increase the level of determinism.

Key Parameters for Controlling Output

Temperature:

  • Controls randomness in model output
  • Lower values (0.1-0.3) = more deterministic, consistent responses
  • Higher values (1.0+) = more creative, varied responses
  • Think of it as flattening or sharpening the probability distribution of next-token selection

Top P (Nucleus Sampling):

  • Limits the model to only consider tokens representing the top P percent of probability mass
  • Low values (0.1) = very consistent, only most probable tokens
  • High values (0.95+) = allows more diversity, can lead to hallucination or language switching
  • Warning: High top_p with multilingual models (like Qwen) can cause unexpected language switches

Top K:

  • Selects only the top K most probable tokens
  • More predictable than top_p since it’s a fixed count
  • Top K of 2 means only the two most likely tokens can be chosen

Repeat Penalty:

  • Controls how often tokens repeat within a sentence
  • Values around 0.75 allow repetition
  • Values around 1.5 discourage repetition
  • Higher values produce more concise outputs

Num Predict / Max Tokens:

  • Limits response length
  • Particularly useful for classification tasks where you expect short outputs (e.g., “true” or “false”)

Setting the Random Seed Is Not Enough

You can set a random number generator seed in most APIs to make outputs reproducible on the same system. However, seeds alone aren’t sufficient because:

  • Different hardware architectures produce different results
  • CPU vs GPU vs TPU inference runtimes vary
  • Floating-point precision differences in chipsets cause cascading effects on model parameters
  • Switching model providers (e.g. OpenAI vs. Azure OpenAI) have implementation differences
  • Docker vs local counts as an architecture change and affects results.
  • Switching frameworks (same model!) produces different results

Bottom line: Even with temperature=0, top_k fixed, and seed set, you cannot guarantee identical outputs across different environments.

Guidelines by Task Type

Factual/Analytical Tasks (classification, Q&A, consistent results needed):

  • Temperature: 0.1-0.3
  • Top P/K: 0.3-0.5
  • Focus on consistency

Creative Tasks (text generation, content creation):

  • Temperature: near 1.0
  • Top P: higher values, but be careful not to introduce hallucination
  • Allow flexibility, but monitor for quality degradation

Balanced Tasks (coding assistance, some creativity with data-backed approach):

  • Temperature: 0.4-0.5
  • Top P: around 0.5
  • Balance creativity and reliability

Context Length Considerations

  • Models have maximum context length limits for inputs and outputs
  • Self-hosted models: context length dramatically affects memory usage
  • Recommendation: Minimise prompts and expected outputs where possible

Managing Hallucination

Hallucination isn’t purely negative. It’s the other side of the creativity coin. We can’t simultaneously demand “don’t hallucinate” and “be creative.” Evaluation is about finding the right balance.

Models hallucinate when asked to do things they have no knowledge of or lack context for. Example: Asking about the fictional “Dr Samantha Rodriguez, Nobel Prize winner in AI for 2025” can produce confident but fabricated biographical details.

Techniques to Reduce Hallucination

1. Instruct the model to acknowledge uncertainty: Add simple instructions like “If you don’t know, say so.” This is a no-brainer addition to virtually every prompt.

2. Ask the model to verify its own statements: Have the model apply verification before presenting answers. This guides output prediction toward correct answers.

3. Leverage thinking models: Reasoning models implicitly think before responding, which improves accuracy. The output tokens act as a guide through a maze toward the correct answer.

4. Include citations:

  • Especially valuable with RAG systems or external information retrieval
  • Ask the model to write out citations/verified facts first, then generate the answer
  • Even labelling things as “fact” helps ground responses

5. Chain of thought prompting: An extension of the verification approach — have the model reason step-by-step.

6. Appropriate temperature settings: Lower temperatures for factual tasks where consistency matters.

Evaluation Methods

Benchmarks

Benchmarks are curated datasets with scripts to quantify LLM performance. They typically:

  • Contain question-answer pairs with expected outputs
  • Use another LLM as a judge to determine correctness
  • Measure results quantitatively

Human evaluation is another form: websites exist where people compare qualitative results from various models.

Evaluation Frameworks: PromptFoo

PromptFoo is a powerful Node-based evaluation framework that provides:

  • A YAML configuration format for defining tests (CI-friendly)
  • A web UI for viewing results
  • Support for multiple providers, prompts, and test cases
  • Both deterministic and LLM-powered assertions

GitHub: https://github.com/promptfoo/promptfoo

Setting Up PromptFoo

Install globally via npm:

npm install -g promptfoo

YAML Configuration Structure

providers:
  - ollama:qwen3:1.7b  # List of models to test

prompts:
  - "Your prompt with {{variable}} placeholders"

tests:
  - vars:
      variable: "test input value"
    assert:
      - type: contains
        value: "expected_word"

Assertion Types

Deterministic assertions:

  • equals — exact match
  • contains — substring match
  • contains-json — valid JSON check
  • Custom JavaScript functions for complex logic

LLM-powered assertions:

  • Factuality checks
  • Semantic similarity
  • Custom model-as-judge evaluations

Running and Viewing Results

promptfoo eval              # Run evaluations
promptfoo view              # Launch web UI to view results

The UI shows:

  • Pass/fail status for each test
  • Variable inputs and model outputs
  • Comparison across different prompts/providers
  • Historical run comparisons
  • Filters for passes/failures

Synthetic Data Generation

Why Generate Synthetic Data?

Problem 1: Test coverage burden Applications with diverse use cases require many test cases. Manual creation is time-consuming.

Problem 2: PII concerns Real data often contains personally identifiable information that shouldn’t be in tests.

Approaches to Generating Synthetic Data

Simple approach: Use your favourite chat LLM provider to generate test data manually. Copy into the right format for PromptFoo.

Script-based approach: Write a script that generates data based on predefined categories/seeds.

Example workflow for email classification:

  1. Define classification categories (urgent, supply, schedule, customer, maintenance, other)
  2. Create seed descriptions for each category
  3. Have an LLM generate realistic emails for each category
  4. Output in PromptFoo-compatible JSON format
  5. Assert that classifications match expected categories

Resources

Using Synthetic Data in Evaluation

Configure PromptFoo to read test cases from a JSON file:

tests: synthetic_emails.json

This allows rapid iteration during prompt development, revealing which categories perform well or poorly.

Example finding: Running classification tests might reveal that “maintenance” and “other” categories perform poorly while “urgent,” “supply,” and “customer” work well. This directs prompt refinement efforts and might even suggest revisiting the classification taxonomy itself.

Practical Recommendations

Start Small Gain confidence with smaller datasets and simpler “happy-path” end-to-end examples. Wide coverage is better than edge-cases at this point.

Build Up Gradually Review and write examples by hand if you can, to begin with, to improve your domain expertise. Start adding edge cases. Refine your prompts.

When to Add Real Data Once you’re happy with your manual prompts, incorporate real data if you have it. Always avoid adding too many tests which only add to the maintenance burden.

When to Add Synthetic Data Once you’re happy with your manual prompts, add synthetic data to add coverage and tackle edge cases. There’s a risk that adding too much synthetic can lead testing and maintenance burdens.

For On-Premise Requirements

If working with private or sensitive data, consider on-premise AI platforms like Helix.ml that allows running AI workloads entirely locally on your own hardware or private data centres.

Key Takeaways

  1. Non-determinism is inherent — you cannot eliminate it, only manage it
  2. Parameters help but aren’t enough — temperature, top_p, top_k, and seeds provide control but not determinism across environments
  3. Prompt engineering reduces hallucination — instruct uncertainty acknowledgment, request verification, use citations
  4. Evaluation is essential — tools like PromptFoo enable systematic testing and comparison
  5. Synthetic data accelerates development — generate diverse test cases to identify weak spots in your prompts
  6. Iterate based on data — use evaluation results to guide prompt improvements rather than hacking blindly
  7. Start simple, add complexity as needed — avoid over-engineering tests that become maintenance burdens

More articles

Leading with AI: A Strategic Playbook for CEOs and CTOs and Building Sustainable and Secure GenAI Systems

A talk that provides strategic level advice on how to acheive success with GenAI projects.

Read more

How to Build AI Agents for Data Analytics: A Practical Guide with Python

Learn how to build AI agents that answer data questions using natural language. Includes Python code examples and BigQuery integration tutorial.

Read more
}