A model that scores 90% on MMLU can fail 40% of your production queries. Benchmark performance and real-world performance are weakly correlated because benchmarks measure general capability, not performance on your specific task with your specific data and your specific failure modes. I have made this mistake - choosing a model based on published benchmarks and discovering the quality gap in production.
Here is how to build evaluations that actually tell you whether your system is working.
Why Standard Benchmarks Are Insufficient
MMLU, HellaSwag, HumanEval, and similar benchmarks are useful for comparing base model capabilities. They are not useful for predicting whether your RAG system will give a nurse accurate medication information, or whether your coding assistant will generate working code for your codebase's patterns.
The core problem: your production queries are not in the benchmark dataset. Your users ask questions in ways benchmarks do not model. Your failure modes - hallucinations about your specific domain, format errors in your specific output schema, edge cases in your data - are invisible to general benchmarks.
Level 1: Task-Specific Unit Tests
Start with deterministic tests. Build a dataset of input-output pairs where the correct answer is unambiguous.
For a clinical coding system: 200 clinical notes with verified ICD-10 codes. For a SQL generator: 100 natural language queries with verified SQL. For a document classifier: 500 documents with verified labels.
This is your regression suite. Every model change, every prompt change, every RAG configuration change gets run against it. If accuracy drops more than 2%, you have a regression.
import json
from openai import OpenAI
def run_eval(test_cases: list, system_prompt: str, model: str = "gpt-4o") -> dict:
client = OpenAI()
results = {"correct": 0, "total": len(test_cases), "failures": []}
for case in test_cases:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": case["input"]}
]
)
output = response.choices[0].message.content.strip()
if case["expected"].lower() in output.lower():
results["correct"] += 1
else:
results["failures"].append({
"input": case["input"],
"expected": case["expected"],
"got": output
})
results["accuracy"] = results["correct"] / results["total"]
return results
Level 2: LLM-as-Judge
For outputs where correctness is not binary - summaries, explanations, recommendations - use an LLM to evaluate. This is the industry-standard approach for evaluating open-ended generation.
The LLM-as-judge pattern: provide the original question, the reference answer (if you have one), and the generated answer to a judge model (usually GPT-4o or Claude). Ask it to score on specific dimensions: accuracy, completeness, relevance, hallucination.
Critical implementation notes:
- Use a stronger model than the one being evaluated, or a different model family to reduce bias toward similar outputs
- Use rubric-based scoring (1-5 on specific criteria) rather than holistic scoring - it is more consistent and more interpretable
- Always include a chain-of-thought reasoning step in the judge prompt before the score - it improves score quality
- Validate your judge against human ratings on a sample. If inter-rater agreement is below 70%, the judge is unreliable
Level 3: RAG-Specific Metrics
For RAG systems, you need to evaluate both components separately and together. The RAGAS framework provides four key metrics:
- Faithfulness: Is the answer grounded in the retrieved context? Measures hallucination rate.
- Answer Relevancy: Does the answer actually address the question?
- Context Precision: Are the retrieved chunks relevant to the question?
- Context Recall: Were all necessary facts retrievable from your corpus?
In practice, I track context precision first - if you are retrieving bad chunks, everything downstream fails. We found at HCLTech that improving context precision from 60% to 85% had more impact on final answer quality than any prompt changes.
Level 4: Human Evaluation with a Structured Protocol
No automated evaluation fully replaces human judgment. For high-stakes applications - medical, legal, financial - you need human review in your evaluation pipeline.
Practical structure for human eval:
- Sample 50-100 production queries weekly (stratified by query type if possible)
- Have subject matter experts rate outputs on 3-4 dimensions (accuracy, completeness, safety, format)
- Track these scores over time - they are your ground truth signal
- Use disagreements between human scores and automated scores to improve your automated eval
Building Your Evaluation Pipeline
The evaluation tooling space has matured quickly. Frameworks worth knowing:
- RAGAS: Best for RAG-specific evaluation. Open source, integrates with LangChain and LlamaIndex.
- Promptfoo: CLI-based, good for regression testing and model comparison. Easy CI/CD integration.
- LangSmith: Good observability and evaluation if you are on the LangChain stack.
- Braintrust: Full evaluation platform with dataset management, human review workflows, and CI integration.
My take
Build evaluation before you build the product. Start with 50-100 representative test cases drawn from your actual use case, not synthetic examples. Add automated evaluation early so you can iterate quickly. Layer in human evaluation for the failure modes that matter most. A system you cannot measure is a system you cannot improve - and in production AI, what you cannot measure will eventually break in ways that surprise you.