Bad embedding choices do not surface in a prototype with 100 documents. They surface in production with 100,000 documents, elevated retrieval latency, degraded search quality, and a surprise bill from your embedding API. I have rebuilt embedding pipelines in production more than once. Here is how to get it right from the start.
Choosing an Embedding Model
The MTEB (Massive Text Embedding Benchmark) leaderboard is the most useful reference for comparing embedding models. But benchmark performance and production performance can diverge. A few principles that hold consistently:
Match the model to your domain. General-purpose embedding models (text-embedding-3-large, BGE-large) perform well on general text. For specialized domains - clinical notes, legal contracts, code - domain-specific models or models fine-tuned on domain data outperform general models, sometimes substantially.
Larger is not always better. text-embedding-3-large (3072 dimensions) produces better embeddings than text-embedding-3-small (1536 dimensions) - at roughly 4x the cost and storage requirement. For many production applications, the small model is sufficient. Benchmark on your data before defaulting to the large model.
Consider open-source models for cost-sensitive applications. BGE-large-en-v1.5 (open source, BAAI) and E5-large (Microsoft) are competitive with OpenAI's models on most English text benchmarks and free to self-host. At 10 million embeddings, the cost difference is material.
Chunking Strategy is More Important Than Most Teams Realize
How you split documents before embedding has more impact on retrieval quality than which embedding model you use. Poor chunking is the most common root cause of poor RAG performance.
Fixed-size chunking: Simple. Split every N tokens with an overlap of M tokens. Fast to implement. Works poorly when sentences or logical units span chunk boundaries.
Semantic chunking: Split at semantic boundaries (sentence breaks, paragraph breaks, section headers). Preserves logical units. Requires more preprocessing but produces better retrieval.
Recursive character text splitting (LangChain default): Tries to split on paragraph breaks first, then sentence breaks, then word breaks. Good default for unstructured text.
Late chunking (2024): A newer approach where you embed the full document first, then derive chunk embeddings from the full-document attention, preserving document-level context in each chunk's embedding. JinaAI's implementation shows strong results for documents where context is distributed across sections.
from langchain.text_splitter import RecursiveCharacterTextSplitter
def create_chunks(text: str, chunk_size: int = 512, chunk_overlap: int = 50) -> list[str]:
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["
", "
", ". ", " ", ""]
)
return splitter.split_text(text)
# For structured documents with headers
def create_header_aware_chunks(text: str) -> list[dict]:
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
splits = splitter.split_text(text)
return [{"content": s.page_content, "metadata": s.metadata} for s in splits]
Batch Processing at Scale
The naive approach - embed documents one by one as they are ingested - does not scale. At 100,000 documents, sequential single-document embedding calls are too slow and too expensive.
Use batch embedding APIs. OpenAI's embedding endpoint accepts batches of up to 2,048 inputs in a single API call. At scale this is significantly faster and cheaper than per-document calls.
import asyncio
from openai import AsyncOpenAI
async def batch_embed(texts: list[str], batch_size: int = 100) -> list[list[float]]:
client = AsyncOpenAI()
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = await client.embeddings.create(
input=batch,
model="text-embedding-3-small"
)
batch_embeddings = [item.embedding for item in sorted(response.data, key=lambda x: x.index)]
all_embeddings.extend(batch_embeddings)
return all_embeddings
Dimensionality Reduction for Cost and Latency
OpenAI's text-embedding-3 models support Matryoshka Representation Learning - you can truncate embeddings to a smaller dimensionality without recomputing them. text-embedding-3-large at 256 dimensions still outperforms text-embedding-ada-002 at 1,536 dimensions, at a fraction of the storage cost.
Storage implications: 1 million vectors at 1,536 float32 dimensions = 6GB. At 256 dimensions = 1GB. For large corpora, this matters for both storage cost and query latency.
Embedding Refresh Strategy
Your documents change. Embedding models improve. You will eventually need to re-embed your corpus.
Design for this from the start:
- Store the embedding model name and version alongside each embedded chunk in your metadata
- Build a pipeline that can re-embed a subset of documents (by date, by document type, by model version)
- When a new embedding model is released, benchmark it against your current model on a held-out sample before committing to a full re-embed
Re-embedding a 100,000 document corpus with OpenAI text-embedding-3-small costs approximately $2-5. This is not prohibitive - but it requires a working pipeline and downtime planning for the vector index transition.
Common Mistakes
- Not storing raw text alongside embeddings. You will need the original text for reranking, for displaying results, and for re-embedding. Always store it.
- Embedding queries and documents with different models. Query embeddings and document embeddings must come from the same model. Mixing models produces meaningless similarity scores.
- Skipping evaluation of retrieval quality. Embed your corpus, then test 50 representative queries and manually check whether the retrieved chunks are actually relevant. Most teams skip this and discover the problem in production.
The takeaway
Get your chunking right before worrying about your embedding model. Test retrieval quality on representative queries before building downstream features on top of it. Use batch APIs from the start - it is a one-line change that dramatically improves throughput and cost. Plan for re-embedding by storing model metadata with every chunk. The embedding pipeline is the foundation of your RAG system; investing a few extra hours here saves weeks of debugging later.