Most AI teams measure accuracy on a test set and ship. This is how you end up with a model that scores 94% on benchmarks and fails catastrophically in production. Real evaluation covers technical metrics, business metrics, human judgment, and continuous monitoring.
The Evaluation Stack
- Technical metrics - Does the model perform well on your held-out dataset?
- Business metrics - Does performance translate to outcomes you care about?
- Human evaluation - Do humans judge the outputs as good?
- Production monitoring - Does quality hold in the wild, over time?
Technical Metrics: Beyond Accuracy
If 95% of your examples are negative class, a model that always predicts "negative" achieves 95% accuracy while being completely useless.
- Precision: Of all the times the model said yes, what fraction was actually yes? (TP / (TP + FP))
- Recall: Of all actual yes cases, what fraction did the model catch? (TP / (TP + FN))
- F1: Harmonic mean of precision and recall.
The precision-recall tradeoff is a product decision, not a math problem. At HCLTech, we built a model to flag potential drug-drug interactions. A false negative could harm a patient. A false positive wastes a clinician's time. We optimized for recall. The threshold was set by clinical stakeholders, not by maximizing F1.
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score
def evaluate_classifier(y_true, y_pred_proba, threshold=0.5):
y_pred = (y_pred_proba >= threshold).astype(int)
precision, recall, f1, _ = precision_recall_fscore_support(
y_true, y_pred, average='binary'
)
auc = roc_auc_score(y_true, y_pred_proba)
print(f"Precision: {precision:.3f}, Recall: {recall:.3f}, F1: {f1:.3f}, AUC: {auc:.3f}")
tp = ((y_pred == 1) & (y_true == 1)).sum()
fp = ((y_pred == 1) & (y_true == 0)).sum()
fn = ((y_pred == 0) & (y_true == 1)).sum()
tn = ((y_pred == 0) & (y_true == 0)).sum()
print(f"TP: {tp}, FP: {fp}, FN: {fn}, TN: {tn}")
BLEU and ROUGE: Use for regression testing (did this code change break something?) not for absolute quality assessment. A summary with high ROUGE-L may be factually wrong or miss the key point while copying surface phrases. Use LLM-as-judge for actual quality.
For RAG pipelines: The RAGAS framework provides automated metrics for context relevance, answer faithfulness, and answer relevance.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from datasets import Dataset
data = {
"question": ["What is the max dose of metformin?"],
"answer": ["The maximum recommended dose is 2550mg/day."],
"contexts": [["Metformin maximum dose is 2550mg daily in divided doses."]],
"ground_truth": ["2550mg per day"]
}
results = evaluate(Dataset.from_dict(data),
metrics=[faithfulness, answer_relevancy, context_relevancy])
print(results)
Business Metrics: The Layer That Actually Matters
Technical metrics are proxies. The gap between them and business outcomes is where most AI features fail:
- An email drafting assistant has 85% user acceptance rate. But time-to-send only decreased by 8%. Users accept the draft but spend 5 minutes editing it anyway.
- A medical coding assistant achieves 91% accuracy on test set. But first-pass resolution rate is unchanged. The 9% errors are concentrated in high-complexity, high-value codes.
For every AI feature, define the business metric chain upfront: technical metric, proximate behavioral metric, downstream business metric, and the explicit hypothesis connecting them.
Human Evaluation
Pairwise comparison over absolute rating. Ask "Which of these two responses is better?" rather than "Rate this 1-5." Pairwise comparison has higher inter-annotator agreement. It's also the basis for RLHF preference data collection.
LLM-as-judge for scale. Using GPT-4 or Claude to evaluate outputs correlates with human judgments at ~0.8-0.9 for many tasks. Provide the same rubric to the LLM that you'd give human annotators.
def llm_evaluate(question: str, answer: str, rubric: str) -> dict:
from openai import OpenAI
import json
client = OpenAI()
prompt = (
"Evaluate this AI response using the provided rubric.\n"
f"Question: {question}\nResponse: {answer}\nRubric: {rubric}\n"
"Score each criterion 1-5 and provide justification.\n"
"Return JSON with keys: factual_accuracy, completeness, conciseness, overall, justification"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
A/B Testing for AI Features
Novelty effect: Run tests for at least 2-4 weeks to let novelty decay. Spillover effects: In B2B, randomize at the account level, not user level. Interleaving: For ranking systems, mix results from two models and track which items get clicked - more statistically efficient than traditional A/B testing.
Production Monitoring
- Input distribution drift: Track embedding distributions over time. If average cosine similarity between new inputs and training inputs drops, flag for retraining.
- Output distribution shift: Are refusal rates changing? Average response lengths shifting?
- Implicit feedback: Thumbs up/down, regeneration rate, copy-paste rate - weak signals individually, strong in aggregate.
- Error sampling: Sample 1-2% of production outputs daily for human review.
The evaluation system is an ongoing investment, not a pre-launch checkbox. The best AI teams treat model evaluation with the same rigor as product analytics - continuous, automated, with human review in the loop.