I've made the wrong call on this decision more than once - including at Edxcare where we spent six weeks building a fine-tuned model that was solvable in three days with better prompt engineering. Let me save you that mistake.
The Core Distinction
- Prompt engineering changes how you ask the model. Weights unchanged. You're steering existing capabilities.
- RAG changes what information the model has at inference time. Weights unchanged. You're adding external knowledge.
- Fine-tuning changes the model's weights. You're modifying the model itself - its style, specialized knowledge, default behaviors.
Prompt engineering and RAG are inference-time interventions. Fine-tuning is a training-time intervention. The cost, complexity, and failure modes are completely different.
Start Here: Can Prompting Solve It?
Before considering anything more complex, exhaust prompt engineering first. Prompt engineering can solve: tone and format consistency, output schemas (JSON), multi-step reasoning (chain-of-thought), role specialization, constraint enforcement, and few-shot learning.
At HCLTech, we were building a medical coding assistant. Initial accuracy was around 62%. Before touching the architecture, we iterated on the prompt: added chain-of-thought instructions, included 5 worked examples, specified ICD-10 code format explicitly, and instructed the model to explain its reasoning before giving the code. Accuracy jumped to 81%. We then moved to RAG - no fine-tuning needed.
The RAG Decision
Use RAG when the problem is knowledge, not capability:
- The model doesn't know your proprietary information
- The model's training data is stale
- You need citations or source attribution
- Your knowledge base changes frequently
- Relevant context is too large to fit in a prompt but can be retrieved selectively
from openai import OpenAI
from pinecone import Pinecone
def rag_query(user_question: str, k: int = 5) -> str:
client = OpenAI()
q_embedding = client.embeddings.create(
model="text-embedding-3-small",
input=user_question
).data[0].embedding
pc = Pinecone(api_key="...")
index = pc.Index("your-index")
results = index.query(vector=q_embedding, top_k=k, include_metadata=True)
context = "\n\n".join([r["metadata"]["text"] for r in results["matches"]])
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer based on provided context. If not in context, say so."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {user_question}"}
]
)
return response.choices[0].message.content
RAG failure modes: Retrieval misses (fix: hybrid search, reranking), context overload (fix: reduce K, use MMR), chunk boundary problems (fix: overlap, parent-child retrieval).
The Fine-Tuning Decision
Fine-tuning is appropriate when:
- Style/format can't be prompted consistently even with extensive examples
- Latency and cost constraints are severe and the task is narrow and high-volume
- Capability gaps exist that examples don't fix (rare for frontier models)
When fine-tuning is NOT the answer: knowledge base changes frequently, you haven't tried RAG + better prompts yet, fewer than a few hundred high-quality training examples, task requires factual accuracy about the real world (fine-tuning can increase confident hallucination).
Decision Tree
- Knowledge gap? Use RAG.
- Format/style issue? Try prompt engineering first (10+ iterations), then consider fine-tuning.
- Severe cost/latency constraints + narrow high-volume task? Fine-tune a smaller model.
- Reasoning capability gap? Try chain-of-thought or a stronger base model first.
- Everything at once? RAG + prompt engineering first, then measure.
Cost Reality Check
Prompt engineering: Zero additional infrastructure cost. RAG: Embedding costs ($0.02/M tokens), vector DB hosting ($70-500/month), 50-200ms added latency. Fine-tuning: ~$25/M training tokens for GPT-4o-mini. Self-hosted GPU: $300-1,500/month.
The Combination That Works Best
In production, the most strong AI features combine all three: well-engineered system prompt + RAG for proprietary/dynamic knowledge + fine-tuning only for residual gaps with sufficient data and volume to justify it. At Edxcare, our learning assistant covered 95% of use cases with prompt engineering + RAG alone.