I spent six weeks at Edxcare building a fine-tuned model for a problem that turned out to be solvable in three days with better prompt engineering. The mistake was answering the wrong question. The question was not "can we fine-tune?" - it was "should we fine-tune?" This framework is designed to prevent that mistake.

The Fundamental Distinction

Prompt engineering changes how you ask the model. The weights stay the same; you are steering existing capabilities with better instructions.

Fine-tuning changes the model's weights. You are modifying the model itself - its style, its default behaviors, its specialized knowledge. It is a training-time intervention, not an inference-time one.

This distinction matters because the cost, complexity, and failure modes are completely different. Getting prompt engineering wrong costs you an afternoon. Getting fine-tuning wrong costs you weeks and real money.

Start with Prompt Engineering - Always

Before considering fine-tuning, exhaust prompt engineering. Most product teams underestimate what good prompt engineering can do.

Prompt engineering can solve: output format consistency, chain-of-thought reasoning, role specialization, constraint enforcement, few-shot learning, tone and style control. At HCLTech, we pushed a medical coding assistant from 62% accuracy to 81% accuracy purely through prompt iteration - chain-of-thought instructions, five worked examples, explicit format constraints, and a "reason first, code second" instruction pattern.

Run at least 10-20 prompt iterations before declaring it insufficient. Test on a representative evaluation set, not just ad-hoc examples. Track accuracy systematically.

When Fine-Tuning Actually Makes Sense

Fine-tuning is genuinely useful in four situations:

  1. Style or format cannot be prompted consistently. If you have tried extensive few-shot examples and the model still produces inconsistent output format, a fine-tuned model will learn the pattern reliably. This is most common for highly structured outputs with unusual schemas.
  2. Latency and cost constraints are severe and the task is narrow and high-volume. A fine-tuned GPT-4o-mini on a specific classification task will be faster and cheaper than GPT-4o with a long few-shot prompt. At 10 million calls per month, this matters.
  3. You need to teach the model a domain-specific vocabulary or reasoning pattern it does not have. Rare for frontier models on most tasks, but real for highly specialized domains (proprietary clinical protocols, niche legal frameworks).
  4. You need to reduce prompt length for a high-volume repetitive task. Fine-tuning can bake the behavior in so you do not need to repeat instructions in every call.

When Fine-Tuning is the Wrong Answer

Do not fine-tune when:

  • Your knowledge base changes frequently. Fine-tuned weights are static. Use RAG for dynamic knowledge.
  • You have fewer than 100-500 high-quality training examples. Below this threshold, a good prompt often outperforms fine-tuning.
  • The task requires factual accuracy about the real world. Fine-tuning can increase confident hallucination - the model learns the style but not necessarily the facts.
  • You have not tried RAG yet. The majority of "we need fine-tuning" problems are actually knowledge gap problems, not capability or style problems. RAG fixes knowledge gaps without touching weights.

The Decision Tree

  • Is the problem a knowledge gap (model does not know your information)? Use RAG.
  • Is the problem a format or style issue? Try prompt engineering first (10+ iterations with evaluation). If it still fails consistently, consider fine-tuning.
  • Is the problem latency or cost with a narrow high-volume task? Fine-tune a smaller model.
  • Is the problem reasoning capability? Try chain-of-thought or a stronger base model first.

Data Requirements for Fine-Tuning

You need high-quality input-output pairs. Quality matters far more than quantity. A common mistake is generating synthetic training data with an LLM - you end up training a model to mimic LLM outputs, which often amplifies the base model's failure modes rather than fixing them.

For OpenAI fine-tuning: minimum 10 examples (useless), meaningful improvement starts around 100, diminishing returns after 10,000 for most tasks. For domain-specific style or format: 200-500 curated examples often suffice. For capability improvement: you typically need thousands of examples and even then, a stronger base model is often a better use of resources.

Cost Reality Check

Prompt engineering: zero additional infrastructure cost beyond your existing API spend.

Fine-tuning with OpenAI: approximately $25 per million training tokens for GPT-4o-mini. A fine-tuning run on 10,000 examples averaging 500 tokens each costs roughly $125. Inference on fine-tuned models costs more than the base model.

Fine-tuning with self-hosted models: GPU costs. An A100 runs $2-4 per hour on cloud providers. A typical fine-tuning run for a 7B parameter model takes 2-8 hours. Ongoing serving costs depending on traffic.

The takeaway

Fine-tuning is a legitimate tool with specific use cases. The mistake is reaching for it too early. Start with RAG for knowledge problems, prompt engineering for behavior problems, and fine-tuning only when you have evidence - real evaluation data - that cheaper approaches are insufficient. The teams shipping the best AI products I have seen use fine-tuning sparingly and prompt engineering obsessively.


More on this