One of the most common questions I get from product managers entering AI for the first time is some version of: "We want to use AI for X. Should we use RAG or fine-tune a model?" Usually the answer is neither, or both, or something else entirely - which isn't satisfying but is honest.
The challenge is that RAG, fine-tuning, and agentic architectures aren't mutually exclusive alternatives. They solve different problems, and the best production AI systems often combine all three. But when you're making initial architecture decisions under resource and time constraints, you need a framework for choosing where to invest first.
This framework is the result of decisions I've made and watched others make across healthcare AI, enterprise software, and adjacent industries. The numbers I'll quote are rough approximations based on available benchmarks and industry data - your specific use case will vary, but the directional tradeoffs hold.
The Core Question: What Problem Are You Actually Solving?
Before choosing an architecture, identify which of these problems you're trying to solve:
- Knowledge access problem: The model doesn't have access to your proprietary data or recent information
- Behavior problem: The model doesn't behave the way you need it to (tone, format, reasoning style, domain-specific conventions)
- Task execution problem: The model needs to take actions, not just generate text
RAG primarily solves the knowledge access problem. Fine-tuning primarily solves the behavior problem. Agents primarily solve the task execution problem.
If you're solving more than one problem, you likely need more than one architecture - but start with the dominant constraint.
RAG: When Your Problem Is Knowledge Access
What it is
Retrieval Augmented Generation gives the model access to a corpus of documents at inference time. When a user asks a question, the system retrieves the most relevant document chunks, adds them to the context window, and lets the model synthesize an answer grounded in those documents.
When to use it
- Your knowledge base changes frequently (documentation, policy updates, product catalogs)
- You need citations - users want to know where the information came from
- Your data is too sensitive to send to a third-party for fine-tuning
- You need to add new information without retraining
- Your use case requires up-to-date information that a static model won't have
When not to use it
- The information users need is in the base model's training data anyway
- You need the model to reason in a specific style, not just access specific facts
- Latency is critical and your retrieval pipeline adds unacceptable delay
- Your knowledge base is poorly structured or lacks good retrieval signal
Cost and latency
A well-optimized RAG pipeline adds approximately 100-300ms to inference time (retrieval + context construction). Cost overhead is primarily compute for embedding and vector search - roughly $0.0001-0.001 per query at current pricing depending on corpus size and embedding model choice. The dominant cost remains the LLM inference itself.
Real-world use cases
- Healthcare: Clinical protocol lookup, drug interaction checking against a hospital's formulary, billing code assistance
- Legal: Contract analysis against a firm's precedent library, regulatory compliance checking against current regulations
- Retail: Product catalog question-answering, return policy lookup, inventory status
- Fintech: Policy document Q&A for underwriting, regulatory guidance lookup, product terms explanation
Fine-Tuning: When Your Problem Is Model Behavior
What it is
Fine-tuning continues training a pre-trained model on your domain-specific examples, adjusting the model's weights to produce outputs that match your desired format, style, and reasoning patterns.
When to use it
- You need highly specific output formats that system prompts alone don't reliably produce
- You need the model to follow domain-specific conventions (medical coding formats, legal citation style, financial reporting standards)
- You're doing high-volume inference and need a smaller, cheaper model to match a larger model's domain performance
- You have 1,000+ high-quality labeled examples and need to encode complex expertise into the model
When not to use it
- You have fewer than a few hundred training examples - prompt engineering or RAG will likely outperform
- You need to update the model's knowledge frequently - retraining is expensive and slow
- You haven't first exhausted prompt engineering - many "fine-tuning problems" are actually prompt problems
- Your team lacks ML expertise to manage training runs, evaluation, and deployment
Cost and latency
Fine-tuning costs have dropped dramatically. OpenAI's fine-tuning API starts around $8/million tokens for GPT-4o-mini fine-tuning. A typical healthcare domain fine-tuning job might run $500-2,000. The payoff is inference cost reduction - a fine-tuned smaller model can match a larger base model at 30-70% lower inference cost at scale.
Real-world use cases
- Healthcare: Clinical note completion that matches a health system's documentation style, ICD-10 coding from clinical text, radiology report standardization
- Legal: Contract clause generation in a firm's specific drafting style, jurisdiction-specific legal analysis format
- Fintech: Financial analysis in a bank's house style, regulatory filing generation in required formats
- Retail: Product description generation matching a brand's voice, customer service responses in brand tone
The fine-tuning trap: Teams often reach for fine-tuning when they've tried a few prompts and aren't happy with the results. Before fine-tuning, invest 1-2 days in serious prompt engineering - structured prompts with clear instructions, few-shot examples, chain-of-thought reasoning, output format specifications. You'll often get 80% of the way to fine-tuning quality at zero cost.
Agentic Architecture: When Your Problem Is Task Execution
What it is
Agents give the model tools and let it take multi-step actions to complete a goal. Rather than generating a single response, the agent reasons about what actions to take, calls tools, observes results, and continues until the goal is complete.
When to use it
- The task requires multiple steps with decision points that depend on intermediate results
- The task requires external data that can't be pre-loaded into context
- The task requires actions (sending emails, updating databases, executing code)
- The task varies enough in structure that a fixed workflow can't be pre-defined
When not to use it
- The task is simple and single-turn - agents add complexity and latency without benefit
- You need guaranteed deterministic behavior - agents introduce variability
- Your tool reliability is poor - agent failures compound tool failures
- The error cost of a wrong action is very high - agents need strong human-in-the-loop before acting in high-stakes contexts
Cost and latency
Agentic workflows multiply inference costs by the number of reasoning steps. A 5-step agent workflow using Claude 3.5 Sonnet might cost $0.05-0.20 per task vs $0.001-0.01 for a single-turn query. Latency is typically 5-30 seconds for multi-step workflows. These costs are falling rapidly as smaller models improve at tool use and reasoning.
Real-world use cases
- Healthcare: Prior authorization automation (check eligibility, retrieve clinical data, complete forms, submit, track), care coordination workflows
- Legal: Due diligence automation (identify key clauses, cross-reference regulations, flag risks, generate summary report)
- Fintech: Loan origination workflow (collect documents, verify identity, pull credit, run models, generate offer)
- Retail: Returns processing (verify purchase, assess reason, process refund or replacement, update inventory)
The Decision Framework
Here's the sequence I use when advising teams on architecture choice:
- Start with base model + prompt engineering. Before anything else, spend a week on prompt engineering with a frontier model. You'll discover what the model can do well and where it falls short. This is your baseline.
- If the failure is knowledge access (the model doesn't know your content): Add RAG.
- If the failure is behavior/format (the model knows the content but produces wrong-shaped output): Consider fine-tuning, but only after exhausting prompt engineering with structured prompts and examples.
- If the failure is task completion (single-turn generation isn't enough): Build agents, starting with simple tool use and expanding.
- For complex systems: Combine. RAG + fine-tuning + agents is the architecture of mature production systems in regulated industries.
Industry-Specific Notes
Healthcare
RAG is almost universally the right starting point - clinical knowledge is constantly updating, citation requirements are strict, and patient data cannot be sent to external fine-tuning endpoints without significant privacy engineering. Agentic architecture is the next investment for workflow automation. Fine-tuning comes third for specific structured output tasks like coding and documentation.
Fintech
Fine-tuning has strong ROI for format-critical applications (regulatory filings, financial reports). RAG is valuable for policy and regulation Q&A. Agents are advancing quickly in back-office automation where workflow complexity and action execution are the dominant requirements.
Retail
RAG for product catalog and policy Q&A. Fine-tuning for brand voice consistency at scale. Agents for customer service automation where multi-turn problem resolution requires external actions (order lookup, refund processing, exchange initiation).
Legal
RAG for precedent and statute retrieval - the knowledge access problem is dominant and the knowledge is constantly updated. Fine-tuning for specific drafting styles. Agents for document review workflows. Human-in-the-loop is non-negotiable at every consequential decision point.
The architecture decision isn't permanent. Most teams should start with RAG because it's the fastest path to value with the lowest infrastructure investment. As you learn where the model falls short, add fine-tuning or agents selectively. The teams that try to architect the perfect system before building anything are always beaten to market by teams that shipped RAG on day 30 and iterated from there.