Most RAG implementations start the same way: chunk your documents, embed them, stuff the top-K results into the prompt. That works in a demo. In production, it breaks in predictable ways - poor retrieval, irrelevant chunks, contradictory context. I have shipped RAG systems at HCLTech for clinical trial analysis and at Edxcare for curriculum Q&A. The pattern that fails most reliably is naive RAG applied to a problem it was never designed for.

This post maps the major RAG patterns to the problems they actually solve.

Pattern 1: Naive RAG

The baseline. Embed your corpus, embed the query, retrieve top-K by cosine similarity, append to prompt.

When it works: Simple factual Q&A over a homogeneous corpus. FAQ bots. Single-topic knowledge bases under a few thousand documents. Low-stakes applications where occasional retrieval misses are acceptable.

Where it breaks: Multi-hop questions requiring reasoning across documents. Heterogeneous corpus with mixed document types. High-precision applications where wrong answers have real costs. Queries that need recent information mixed with historical context.

The failure mode is almost always retrieval quality, not generation quality. If you feed the LLM good context, it generates good answers. The problem is getting the right context.

Pattern 2: Advanced Retrieval - Hybrid Search and Reranking

The first upgrade worth making. Hybrid search combines dense retrieval (embeddings) with sparse retrieval (BM25 keyword matching) using Reciprocal Rank Fusion or a learned combiner.

Why this matters: embeddings capture semantic similarity but can miss exact keyword matches that humans intuitively expect. A query for "IND application safety reporting" in a pharmaceutical corpus should match documents containing those exact terms even if the embedding similarity is moderate.

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np

def hybrid_search(query, documents, k=10, alpha=0.5):
    # Dense retrieval
    model = SentenceTransformer("BAAI/bge-large-en-v1.5")
    q_emb = model.encode(query)
    doc_embs = model.encode(documents)
    dense_scores = np.dot(doc_embs, q_emb)

    # Sparse retrieval
    tokenized = [d.split() for d in documents]
    bm25 = BM25Okapi(tokenized)
    sparse_scores = np.array(bm25.get_scores(query.split()))

    # Normalize and combine
    dense_norm = (dense_scores - dense_scores.min()) / (dense_scores.max() - dense_scores.min() + 1e-8)
    sparse_norm = (sparse_scores - sparse_scores.min()) / (sparse_scores.max() - sparse_scores.min() + 1e-8)
    combined = alpha * dense_norm + (1 - alpha) * sparse_norm

    top_k_idx = np.argsort(combined)[::-1][:k]
    return [documents[i] for i in top_k_idx]

After retrieval, add a cross-encoder reranker (Cohere Rerank, BGE Reranker). The retrieval step casts a wide net; the reranker selects the best candidates. On our clinical trial corpus, adding reranking improved answer relevance by about 30% over dense-only retrieval.

Pattern 3: Parent-Child Retrieval (Multi-Granularity)

The chunking problem: small chunks retrieve precisely but lose context. Large chunks provide context but introduce noise and hit token limits.

The solution: index small chunks for retrieval, but return the parent chunk (or full document section) for generation. You retrieve based on a 128-token sentence, then feed the 1,024-token paragraph it came from to the LLM.

LlamaIndex calls this the "parent document retriever." LangChain implements it as the ParentDocumentRetriever. Use this pattern whenever your documents have natural hierarchical structure - sections within papers, articles within reports, FAQs with long answers.

Pattern 4: Agentic RAG

When a single retrieval step is insufficient for multi-hop questions. The agent decides iteratively: retrieve, read, determine what is still unknown, retrieve again, synthesize.

The ReAct pattern (Reason + Act) is the standard implementation: the LLM reasons about what it needs, calls a retrieval tool, gets results, reasons again, and repeats until it can answer confidently.

When to use agentic RAG:

  • Questions requiring information from multiple documents combined non-trivially
  • Research tasks where the query itself evolves as you learn more
  • Situations where the LLM needs to validate or cross-check retrieved information

Cost warning: Agentic RAG can make 5-10 LLM calls per user question. Budget and latency implications are significant. Only use when the quality improvement justifies the cost.

Pattern 5: Graph RAG

Microsoft Research released GraphRAG in 2024. The core insight: traditional vector search finds locally similar chunks but misses global patterns and relationships across the corpus.

GraphRAG builds a knowledge graph from your documents (entities, relationships, communities), then uses graph traversal alongside vector search. It excels at questions like "what are the major themes across this corpus" or "how does concept A relate to concept B across multiple documents."

The tradeoff: indexing is expensive (requires LLM calls to extract entities and build the graph). Not justified for simple Q&A. Worth it for large, complex corpora where relational reasoning matters.

Decision Framework

  • Simple Q&A, homogeneous corpus: Naive RAG with hybrid search
  • Long documents with natural structure: Add parent-child retrieval
  • High precision requirements: Add reranking
  • Multi-hop questions: Agentic RAG
  • Complex relational queries across large corpus: Graph RAG

Where this lands

Start with the simplest pattern that could work. Measure retrieval quality before optimizing generation quality - if your retrieved chunks are wrong, a better LLM will not save you. Add complexity only when measurements show where the system is failing. The teams I have seen waste the most time on RAG are the ones who jumped to agentic patterns before they had good retrieval working.


You might also like