When I was building the search infrastructure for a healthcare knowledge platform at HCLTech, we replaced keyword-based search with embedding-based semantic search. A query like "elevated liver enzymes" started returning results for "hepatic biomarkers" and "ALT/AST elevation" - terms that never appeared in the original query. Recall improved by 34%. This is what embeddings do: they encode meaning, not just text.

What Is an Embedding?

An embedding is a dense vector representation of text. A typical embedding model outputs a vector of 768 to 3,072 floating-point numbers. The key property: semantically similar text has geometrically similar vectors. "Dog" and "puppy" are close in vector space. "Dog" and "arbitrage" are far apart.

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(text: str) -> list:
    return client.embeddings.create(
        model="text-embedding-3-small", input=text
    ).data[0].embedding

def cosine_similarity(a: list, b: list) -> float:
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

e1 = embed("elevated liver enzymes")
e2 = embed("hepatic biomarkers")
e3 = embed("stock market volatility")

print(cosine_similarity(e1, e2))  # ~0.82
print(cosine_similarity(e1, e3))  # ~0.31

How Embedding Models Are Trained

Modern text embedding models are BERT-family transformers trained with contrastive learning. The training process uses a siamese network: two identical encoders processing two pieces of text. Training pairs include positive pairs (semantically similar text) and negative pairs (semantically different text). The loss function pushes positive pair embeddings closer together and negatives further apart. Over millions of training pairs, the model learns to encode semantic relationships in geometric form.

Matryoshka Representation Learning

OpenAI's text-embedding-3-* models use Matryoshka Representation Learning (MRL): the first N dimensions of the full embedding are themselves a valid lower-dimensional embedding. You can truncate text-embedding-3-large from 3,072 to 256 dimensions and retain ~85% of performance at 12x lower storage cost.

response = client.embeddings.create(
    model="text-embedding-3-large",
    input=text,
    dimensions=256  # Truncate from 3072 to 256
)
# Storage: 256 * 4 bytes = 1KB vs 12KB at full size
# For 10M documents: 1GB vs 12GB

Choosing the Right Embedding Model

The MTEB (Massive Text Embedding Benchmark) leaderboard is the standard reference. Top performers:

  • text-embedding-3-large (OpenAI): Strong general-purpose, $0.13/M tokens.
  • voyage-large-2 (Voyage AI): Often outperforms OpenAI on RAG tasks, $0.12/M tokens.
  • BAAI/bge-large-en-v1.5: Open-source, runs locally. Good for privacy-sensitive use.
  • E5-mistral-7b-instruct: Excellent for asymmetric retrieval (short query, long document).

Asymmetric vs Symmetric Retrieval

Many models are trained on symmetric pairs (similar articles) and perform poorly on asymmetric retrieval (short question, long document answer). For Q&A over documents, use models trained for asymmetric retrieval. Or use HyDE (Hypothetical Document Embeddings): generate a hypothetical answer with an LLM, then embed that answer instead of the raw question.

def hyde_embed(question: str) -> list:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Write a brief, factual answer to: {question}"
        }]
    )
    hypothetical_answer = response.choices[0].message.content
    return embed(hypothetical_answer)

Embeddings Beyond Search

Clustering: At Mamaearth, we clustered customer reviews by semantic theme - surfacing product feedback signals keyword search missed entirely. Anomaly detection: Flag clinical notes statistically far from a doctor's historical baseline. Zero-shot classification: Embed class descriptions, classify incoming items by nearest class.

classes = {
    "billing": embed("invoice payment charge subscription"),
    "technical": embed("bug error crash not working"),
    "account": embed("login password access account settings")
}

def classify_ticket(ticket_text: str) -> str:
    ticket_emb = embed(ticket_text)
    scores = {
        label: cosine_similarity(ticket_emb, class_emb)
        for label, class_emb in classes.items()
    }
    return max(scores, key=scores.get)

Where Embeddings Break Down

  • Negation: "Patient has no fever" and "Patient has fever" have similar embeddings.
  • Long documents: Most models have a 512-8K token limit. Truncation silently drops content. Use chunking.
  • Domain specificity: General web-trained models produce poor embeddings for specialized domains. Consider PubMedBERT for biomedical, LegalBERT for legal.
  • Cross-lingual: Multilingual models underperform monolingual models on individual languages.

Choosing the wrong embedding model early means rebuilding your entire vector index later - a painful and expensive migration. Pick deliberately, test on your domain.


More on this