I spent six weeks at Edxcare building a fine-tuned model for a problem that turned out to be solvable in three days with better prompt engineering. The mistake was answering the wrong question. The question was not "can we fine-tune?" - it was "should we fine-tune?" This framework is designed to prevent that mistake.
The Fundamental Distinction
Prompt engineering changes how you ask the model. The weights stay the same; you are steering existing capabilities with better instructions.
Fine-tuning changes the model's weights. You are modifying the model itself - its style, its default behaviors, its specialized knowledge. It is a training-time intervention, not an inference-time one.
This distinction matters because the cost, complexity, and failure modes are completely different. Getting prompt engineering wrong costs you an afternoon. Getting fine-tuning wrong costs you weeks and real money.
Start with Prompt Engineering - Always
Before considering fine-tuning, exhaust prompt engineering. Most product teams underestimate what good prompt engineering can do.
Prompt engineering can solve: output format consistency, chain-of-thought reasoning, role specialization, constraint enforcement, few-shot learning, tone and style control. At HCLTech, we pushed a medical coding assistant from 62% accuracy to 81% accuracy purely through prompt iteration - chain-of-thought instructions, five worked examples, explicit format constraints, and a "reason first, code second" instruction pattern.
Run at least 10-20 prompt iterations before declaring it insufficient. Test on a representative evaluation set, not just ad-hoc examples. Track accuracy systematically.
When Fine-Tuning Actually Makes Sense
Fine-tuning is genuinely useful in four situations:
- Style or format cannot be prompted consistently. If you have tried extensive few-shot examples and the model still produces inconsistent output format, a fine-tuned model will learn the pattern reliably. This is most common for highly structured outputs with unusual schemas.
- Latency and cost constraints are severe and the task is narrow and high-volume. A fine-tuned GPT-4o-mini on a specific classification task will be faster and cheaper than GPT-4o with a long few-shot prompt. At 10 million calls per month, this matters.
- You need to teach the model a domain-specific vocabulary or reasoning pattern it does not have. Rare for frontier models on most tasks, but real for highly specialized domains (proprietary clinical protocols, niche legal frameworks).
- You need to reduce prompt length for a high-volume repetitive task. Fine-tuning can bake the behavior in so you do not need to repeat instructions in every call.
When Fine-Tuning is the Wrong Answer
Do not fine-tune when:
- Your knowledge base changes frequently. Fine-tuned weights are static. Use RAG for dynamic knowledge.
- You have fewer than 100-500 high-quality training examples. Below this threshold, a good prompt often outperforms fine-tuning.
- The task requires factual accuracy about the real world. Fine-tuning can increase confident hallucination - the model learns the style but not necessarily the facts.
- You have not tried RAG yet. The majority of "we need fine-tuning" problems are actually knowledge gap problems, not capability or style problems. RAG fixes knowledge gaps without touching weights.
The Decision Tree
- Is the problem a knowledge gap (model does not know your information)? Use RAG.
- Is the problem a format or style issue? Try prompt engineering first (10+ iterations with evaluation). If it still fails consistently, consider fine-tuning.
- Is the problem latency or cost with a narrow high-volume task? Fine-tune a smaller model.
- Is the problem reasoning capability? Try chain-of-thought or a stronger base model first.
Data Requirements for Fine-Tuning
You need high-quality input-output pairs. Quality matters far more than quantity. A common mistake is generating synthetic training data with an LLM - you end up training a model to mimic LLM outputs, which often amplifies the base model's failure modes rather than fixing them.
For OpenAI fine-tuning: minimum 10 examples (useless), meaningful improvement starts around 100, diminishing returns after 10,000 for most tasks. For domain-specific style or format: 200-500 curated examples often suffice. For capability improvement: you typically need thousands of examples and even then, a stronger base model is often a better use of resources.
Cost Reality Check
Prompt engineering: zero additional infrastructure cost beyond your existing API spend.
Fine-tuning with OpenAI: approximately $25 per million training tokens for GPT-4o-mini. A fine-tuning run on 10,000 examples averaging 500 tokens each costs roughly $125. Inference on fine-tuned models costs more than the base model.
Fine-tuning with self-hosted models: GPU costs. An A100 runs $2-4 per hour on cloud providers. A typical fine-tuning run for a 7B parameter model takes 2-8 hours. Ongoing serving costs depending on traffic.
The takeaway
Fine-tuning is a legitimate tool with specific use cases. The mistake is reaching for it too early. Start with RAG for knowledge problems, prompt engineering for behavior problems, and fine-tuning only when you have evidence - real evaluation data - that cheaper approaches are insufficient. The teams shipping the best AI products I have seen use fine-tuning sparingly and prompt engineering obsessively.