Fine-Tuning vs Prompt Engineering vs RAG: A Decision Framework That Actually Helps

My team spent three months fine-tuning a model. Turns out a well-written prompt did the same thing. We don't talk about Q3.

Should we fine-tune, use RAG, or just prompt engineer? I hear this weekly. The question usually means the team hasn't defined what they actually need — these solve fundamentally different problems.

Here's the framework I use. It's not complicated, which is the point.

What is your model actually failing at?

If it behaves correctly but doesn't have access to your data — RAG. If it understands your data but doesn't follow your domain conventions — possibly fine-tuning. If it's mostly working but the output isn't quite right — better prompts.

That's the decision tree. Everything else is detail.

Prompt engineering first. Always.

Teams skip this constantly. They jump to RAG or fine-tuning without spending serious time on prompts, then wonder why things aren't better.

Good prompt engineering isn't "write a longer system prompt." It's structured few-shot examples, chain-of-thought, output constraints, task decomposition. Unglamorous. Solves about 70% of problems people think require fine-tuning.

The catch: prompts are fragile. They break when models update, consume tokens every request, and there's a hard ceiling — if the model doesn't know something, no prompt fixes that.

RAG is the right answer more often than you think

If your problem is "the model doesn't know about our data," don't retrain it. Hand it the documents at inference time.

I've watched teams spend months fine-tuning on their knowledge base when a RAG pipeline would've given better results in weeks. Fine-tuning bakes knowledge into weights, which sounds great until that knowledge changes next quarter. RAG keeps it external and updateable.

The part nobody warns you about: RAG is harder than the tutorials suggest. Chunking strategy matters enormously. Embedding quality is your ceiling. Retrieval relevance is the actual bottleneck, not generation. If retrieval returns irrelevant docs, the model will confidently synthesise nonsense from them.

What actually moves the needle: chunk size and overlap (too small loses context, too large dilutes relevance), hybrid search combining vectors with BM25, and a re-ranker after initial retrieval. Also — measure retrieval recall separately from generation quality. Almost nobody does this.

Fine-tuning is the last resort, not the first

Fine-tuning changes behaviour at the weight level. It's appropriate when you need domain-specific conventions or tasks that prompts can't specify well. It is not appropriate when you just need the model to know your data. That's RAG.

You need good training data — not just volume, quality. I've seen teams fine-tune on thousands of inconsistent, poorly labelled examples. Garbage in, garbage out, except now the garbage is baked into weights and you've spent £40K on GPU time.

There's also the operational tail: maintaining a custom model, re-evaluating on base model updates, handling drift, building eval pipelines. This isn't a one-off cost.

In practice

Work through it in order. Try prompting properly — not for an afternoon, for a week. Build an eval set. If the model lacks knowledge, add RAG. Get chunking and embeddings right. Measure retrieval independently. Only if behaviour is still wrong with good prompts and good retrieval, consider fine-tuning.

The best production systems I've seen use all three together. But they got there by adding complexity in order, not starting with the most complex option.

One more thing: if someone tells you they need fine-tuning, ask if they've tried prompting properly first. They almost certainly haven't.