RAG vs Fine-Tuning in 2026: A Decision Framework for LLM Teams
by Dr. Phil Winder , CEO
RAG or fine-tuning? Most LLM applications are RAG first, then fine-tuning or custom models as an optimisation or in very specific use cases. Retrieval-augmented generation (RAG) handles knowledge (that changes over time), whereas fine-tuning handles behaviour that should not. The best production implementations combine both. This article gives you the decision tree, the comparison table, and some example tooling to choose well.
Below is the framework we use at Winder.AI when scoping LLM engagements.
The 2026 decision at a glance
| Approach | When it wins | Typical cost | Time to first result | Best for | Main weakness |
|---|---|---|---|---|---|
| Pure prompt engineering | Single-use queries, prototypes | Negligible | Hours | Quick demos, exploration | No persistence; cannot teach new behaviour |
| RAG (retrieval-augmented generation) | Knowledge changes; need citations | £5,000 to £40,000 build | 1 to 3 weeks | Q&A over docs, support, search | Quality capped by retrieval quality |
| Supervised fine-tuning (full or LoRA) | Fixed style, schema, or narrow skill | £10,000 to £60,000 plus data | 4 to 8 weeks | Tone, format, specialised tasks | Stale facts; hard to update |
| Hybrid (fine-tuned base + RAG) | Production at scale with strict format and live data | £30,000 to £120,000 | 6 to 12 weeks | Regulated workflows, customer-facing assistants | Two systems to maintain |
| Continued pre-training | Domain language model from scratch | £100,000 to £1m+ | Months | Pharma, legal, or scientific vocab gap | Rarely justified vs RAG plus fine-tune |
Ranges are mid-market 2026, assuming a clean-enough data baseline. Costs widen at the extremes for highly regulated or research-grade work.
The decision tree
Walk this top-to-bottom. Stop at the first “yes”.
- Does the answer depend on data that changes (policies, prices, product specs, tickets, code)? → RAG. Fine-tuning bakes the data into weights and goes stale the moment the data updates. Use RAG so you can replace documents without retraining.
- Do you need to cite sources, show provenance, or pass an audit? → RAG. Fine-tuned models cannot point at the document that justified an answer. Retrieval can.
- Is the problem a fixed output schema (strict JSON, regulatory form, structured extraction) and prompting alone is unreliable? → Fine-tune. A few thousand examples of input → desired output collapses the variance that prompting cannot fully eliminate.
- Do you need a consistent tone, persona, or domain terminology where prompting alone won’t help? → Fine-tune. Style is a behaviour, not a fact.
- Do you need a small open-source model to match a larger frontier model on one narrow task, for cost or latency? → Fine-tune. This is the strongest commercial case for fine-tuning in 2026: distil a frontier model into a tuned small open-weight model and cut inference cost by an order of magnitude on a task the base model already half-handles.
- Otherwise → RAG, then revisit fine-tuning only for the specific behaviours retrieval cannot fix.
Most production systems land at step 6 and then loop back to step 3 or 4 once the RAG baseline exposes the residual behaviours that need locking down.
What is RAG?
RAG uses off the shelf models (language and ranking) from vendors or the open-source community. At query time code retrieves the most relevant parts of your information, stuffs it into the prompt, and asks the model to answer using that context. The model contributes language and reasoning; your data contribute facts.
The advantages stack up:
- Update facts by updating documents. No retraining cycle.
- Swap base models freely. Move from GPT-4o to Claude Sonnet to a self-hosted Llama without re-doing the work.
- Cite sources. Every answer points at the chunks that justified it.
- Cheap to iterate. A bad retrieval result is fixed by improving chunking, embeddings, or reranking, not by retraining.
The weakness is real and worth naming: RAG quality is capped by retrieval quality. If the retriever cannot find the right chunk, the model cannot answer. What’s worse, the wrong answer can steer the model to state an incorrect or invalid answer. Most “RAG does not work” stories are retrieval problems masquerading as generation problems.
A production RAG stack in 2026
A working 2026 RAG pipeline has at minimum:
- Chunking: that matches the information structure.
- Embeddings: that best represent the information content. Popular vendor embeddings don’t work particularly well with domain-specific content like code, for example.
- Vector store: pgvector or pg_vectorscale for PostgreSQL fans. Dedicated vendor options if you want personal or professional support.
- Reranking: a cross-encoder (Cohere Rerank, bge-reranker) over the top 20 to 50 retrieved chunks. This step is the single biggest quality lever and the one most teams skip.
- Orchestration: roll your own, use an agentic harness or pick a library.
- Evaluation: Ragas, TruLens, or DeepEval for automatic eval; a small labelled gold set for regression.
You can skip the reranker, the evals, and most of the orchestration for a quick demo. Just know what you are skipping: most production systems that “don’t work” are missing one of these. We have written extensively about LLM application architecture and frameworks if you want to go deeper on the stack itself.
When fine-tuning is the right call
Two cases, in the order they win the commercial argument. Everything else people fine-tune for is better solved by retrieval, a stronger prompt, or a better base model.
1. Distillation for cost and latency. This is the strongest 2026 case for fine-tuning, and the one most teams overlook. Use a frontier model (GPT-5, Claude Sonnet 4.6) to generate high-quality outputs on your specific task, then fine-tune a small open-source model (Llama, Qwen, Mistral, etc.) on those outputs. The tuned smaller model delivers near-frontier quality on the narrow task at roughly one-tenth the inference cost and a fraction of the latency. At production volume, the savings dwarf the one-off training cost in weeks. This is where the unit economics of LLM applications get genuinely better.
2. Style, tone, and output structure. When prompting drifts and you need consistent behaviour every time: a customer service assistant that must sound a specific way, a clinical summariser that must adopt a clinical register, a structured-output generator that must hit a fixed JSON schema. A few hundred to a few thousand curated examples (LoRA needs less than people think) remove the residual variance that prompting plus retries cannot. Worth doing when the cost of an off-style output is high (regulatory, brand, audit).
What fine-tuning does not fix: stale knowledge, missing context, hallucinated facts about your business. Those belong in retrieval, not weights.
Fine-tuning tooling in 2026
Self-hosted (open-weight models)
| Goal | Tool |
|---|---|
| Cheap, fast LoRA/QLoRA on one GPU (lowest VRAM) | Unsloth |
| Same, but GUI-first with the widest model coverage | LLaMA-Factory |
| Reproducible, config-driven, production training | Axolotl (YAML + FSDP2/DeepSpeed) |
| Distributed full fine-tune at scale | Axolotl or LLaMA-Factory over FSDP/DeepSpeed; TRL for custom loops |
| RL / reasoning fine-tune (DeepSeek-style) | GRPO via TRL, Unsloth, or Axolotl |
| Preference-based fine-tune | DPO via TRL |
Managed (proprietary models)
| Goal | Tool |
|---|---|
| Fine-tune Claude | AWS Bedrock (Claude 3 Haiku only, us-west-2) |
| Fine-tune Gemini | Vertex AI (SFT and preference tuning, Gemini 2.5 family) |
| Fine-tune OpenAI models | OpenAI Fine-tuning API: supervised, DPO, and reinforcement fine-tuning on GPT-4o-mini and GPT-4.1. Check current model availability before committing. |
Always
| Goal | Tool |
|---|---|
| Evaluation | Held-out test set plus task-specific metrics (not just loss) |
LoRA and QLoRA train a tiny adapter (often less than 1% of model parameters) instead of the full model. This is cheap, fast, and reversible: you can stack adapters or throw them away. For the bulk of open-weight fine-tuning, LoRA is the right tool. The bar for needing any fine-tune keeps rising, since current base models handle out of the box a lot of what required tuning 18 months ago.
The hybrid pattern that wins in production
Most production systems we deliver end up as a hybrid:
- Build RAG first. Prove the use case. Collect, curate, and measure quality with an evaluation set.
- Identify residual failure modes and attempt to improve through prompt engineering and pipeline tuning. Anything that can’t be fixed is considered for fine tuning: schema drift, tone inconsistency, format errors, the long tail where prompting just cannot get to 100%.
- Fine-tune a smaller, cheaper base model to handle those residuals.
- Run the fine-tuned model inside the same RAG pipeline.
This stacks the strengths: live facts from retrieval, locked behaviour from fine-tuning, lower inference cost than a frontier model would charge for the same job. It is also the pattern that survives base-model upgrades best, because the RAG layer is model-agnostic and the fine-tuned adapter can be retrained on the new base with a small fraction of the original effort.
For an opinionated view of how this maps to actual engagements, see our LLM consulting and development, AI agent development, and generative AI consulting service pages.
Common mistakes we see in 2026
- Fine-tuning to fix knowledge gaps. Fine-tuning is not a search index. The facts go stale, and you cannot cite them. Use RAG.
- Vector-only top-k with no reranking. Plain cosine-similarity top-k is the largest preventable quality cap in production RAG. Use hybrid (lexical + vector) retrieval, then a cross-encoder reranker over the top 20–50 candidates. Watch the reranker’s token limit, as most cross-encoders silently truncate at 512 tokens. Use a long-context reranker if your chunks are bigger.
- Naive fixed-size chunking. Splitting every 500 tokens regardless of structure shreds context. Use semantic or hierarchical chunking, and consider contextual retrieval (prepend a generated summary per chunk) or a small-to-big / parent-document pattern.
- No eval harness. Without a labelled gold set and an automatic metric, you cannot tell whether a change helped or regressed. Ragas for metric design, DeepEval for CI/CD quality gates: set up in week one. Measure retrieval and generation separately; most failures are retrieval, not generation.
- Full fine-tunes when LoRA would do. Full fine-tunes are slower, more expensive, and harder to roll back. Start with LoRA.
- Assuming you need thousands of examples. The old “1,000 minimum” rule no longer holds with LoRA/QLoRA. For classification and extraction, 200–500 curated examples is often enough; quality beats quantity, and many teams stall chasing dataset sizes parameter-efficient methods don’t require.
The takeaway
In 2026, RAG is the default for LLM applications. Fine-tuning is the right tool for two things: distilling frontier-model performance into a smaller cheaper model for cost and latency, and locking in style, tone, or output structure that prompting cannot hold. Almost never the right tool for facts. The production systems that work best are hybrids: RAG for what changes, fine-tuning for what should not.
If you would like an opinionated, scope-specific recommendation on whether your use case calls for RAG, fine-tuning, or both, book a no-obligation scoping call. We will tell you the right shape for what you are actually trying to do.
Frequently asked questions
Default to RAG. In 2026, retrieval-augmented generation (RAG) is the correct first choice for roughly 80% of enterprise LLM applications because it lets you change the source data without retraining, attribute answers to documents, and switch base models freely. Fine-tune only when you need to distil a frontier model into a smaller cheaper one for cost and latency, or to lock in a fixed style, tone, or output schema that prompting cannot hold. Most production systems end up as a hybrid: RAG for facts, light fine-tuning for tone and format.
RAG keeps the model the same and changes what the model sees: it retrieves relevant chunks from your data at query time and stuffs them into the prompt. Fine-tuning changes the model weights themselves using your examples, so the new behaviour is baked in. RAG is for knowledge that changes; fine-tuning is for behaviour that should not change.
Fine-tune for two main reasons. First, distillation: tune a small open-source model on outputs from a frontier model so you can match near-frontier quality on a narrow task at roughly one-tenth the inference cost. Second, style and structure: tone, persona, a fixed JSON schema, a regulatory format that prompting cannot hold consistently. Everything else (facts, search, current data) belongs in retrieval, not in weights.
A minimum-viable RAG pipeline in 2026 uses an embedding model (OpenAI text-embedding-3-large, Cohere embed-v3, or open-source bge-large), a vector store (pgvector, Qdrant, Weaviate, or Pinecone), a chunking strategy (semantic or hierarchical, not naive fixed-size), a reranker (Cohere Rerank or a cross-encoder) for the top results, and an evaluation harness (Ragas, TruLens, or DeepEval). The orchestration layer is usually LangChain or LlamaIndex. Skipping reranking and evals is the most common reason RAG systems underperform.
For supervised fine-tuning of open-source models, use Hugging Face PEFT with LoRA or QLoRA: it trains a tiny adapter rather than the full model, which is cheap and reversible. For OpenAI models, use the OpenAI fine-tuning API for GPT-4o-mini and GPT-4.1 (supervised, DPO, and reinforcement fine-tuning are all supported in 2026). For Anthropic Claude, fine-tuning is available via Bedrock for Claude Haiku. For evaluation, use a held-out test set and at least one task-specific metric, not just loss.
A LoRA fine-tune of a 7B to 13B open-source model on a single A100 typically costs US$50 to US$500 in compute for a few thousand examples. An OpenAI GPT-4o-mini fine-tune is usually US$10 to US$200 in training cost depending on dataset size. The hidden cost is data: producing a clean, labelled, deduplicated training set of even 1,000 examples often takes a senior engineer one to two weeks, which is far more expensive than the GPU time.
Yes, and this is the pattern that wins in production. The common shape is: fine-tune a smaller open-source model on the output style, schema, or domain reasoning you need, then wrap it with RAG so it answers from your current knowledge base. This combines a cheap, consistent base behaviour with up-to-date facts. The order matters: build RAG first, prove the use case, then fine-tune only the parts that retrieval cannot fix.