I've seen AI features kill products not because they didn't work, but because they worked at a price point that made the unit economics impossible. At HCLTech, an early prototype of a clinical documentation assistant was generating $0.18 per patient encounter in inference costs - $3.6M annually at projected volume. Understanding and attacking each cost lever got that number down to $0.03.
Token Economics Fundamentals
LLM pricing is charged per token (approximately 0.75 words). Output tokens typically cost 3-5x more than input tokens because they require sequential generation. Current 2024 pricing (approximate):
- GPT-4o: $5/M input, $15/M output
- GPT-4o-mini: $0.15/M input, $0.60/M output (33x cheaper than GPT-4o)
- Claude 3.5 Sonnet: $3/M input, $15/M output
- Claude 3 Haiku: $0.25/M input, $1.25/M output
- Gemini 1.5 Flash: $0.075/M input, $0.30/M output
def estimate_cost(input_tokens: int, output_tokens: int, model: str = 'gpt-4o') -> float:
pricing = {
'gpt-4o': (5.00, 15.00),
'gpt-4o-mini': (0.15, 0.60),
'claude-3-5-sonnet': (3.00, 15.00),
'gemini-1.5-flash': (0.075, 0.30),
}
ip, op = pricing[model]
return (input_tokens / 1_000_000) * ip + (output_tokens / 1_000_000) * op
# 10K docs, 2K input + 500 output tokens each:
per_doc_gpt4o = estimate_cost(2000, 500, 'gpt-4o') # $0.0175
per_doc_mini = estimate_cost(2000, 500, 'gpt-4o-mini') # $0.0006
print(f'GPT-4o: ${per_doc_gpt4o * 10000:.2f} for 10K docs')
print(f'Mini: ${per_doc_mini * 10000:.2f} for 10K docs')
Lever 1: Model Selection and Routing
Switching from GPT-4o to GPT-4o-mini costs 97% less. GPT-4o-mini excels at: classification, structured extraction, simple Q&A over context, summarization, first-pass triage. GPT-4o is worth the premium for: complex multi-step reasoning, code generation, nuanced judgment tasks.
def route_query(query: str, context_length: int) -> str:
if context_length > 100_000:
return 'claude-3-5-sonnet' # Large context
if len(query.split()) < 20 and context_length < 2000:
return 'gpt-4o-mini' # Simple, short query
return 'gpt-4o'
Lever 2: Prompt Optimization
A 2,000-token system prompt running 1M calls/day costs $10/day on GPT-4o-mini or $2,500/day on GPT-4o. Compress prompts without losing quality using gist compression or LLMLingua (Microsoft's open-source library that reduces prompt length by 3-20x).
from llmlingua import PromptCompressor
llm_lingua = PromptCompressor()
compressed = llm_lingua.compress_prompt(
original_prompt, instruction='', question=user_query, target_token=400
)
print(f'Compression ratio: {compressed["ratio"]:.2f}x')
Lever 3: Caching
Caching is the highest-ROI cost lever. Exact match caching: hash the prompt, return cached response if seen before. Semantic caching: embed the query, check for semantically similar cached queries (20-40% hit rates common). OpenAI prompt caching: identical prompt prefixes charged at 50% automatically.
import hashlib, json, redis
cache = redis.Redis(host='localhost', port=6379)
def cached_llm_call(prompt: str, model: str, ttl: int = 3600) -> str:
key = hashlib.sha256(f'{model}:{prompt}'.encode()).hexdigest()
cached = cache.get(key)
if cached:
return json.loads(cached)['response']
response = call_llm(prompt, model)
cache.setex(key, ttl, json.dumps({'response': response}))
return response
Lever 4: Batching
OpenAI's Batch API processes requests asynchronously with 24-hour turnaround at half price. Appropriate for: nightly document processing, bulk classification, embedding generation, scheduled reports.
from openai import OpenAI
import json
client = OpenAI()
batch_requests = [
{
'custom_id': f'doc_{i}',
'method': 'POST',
'url': '/v1/chat/completions',
'body': {'model': 'gpt-4o-mini',
'messages': [{'role': 'user', 'content': text}],
'max_tokens': 500}
}
for i, text in enumerate(documents)
]
with open('batch_input.jsonl', 'w') as f:
for req in batch_requests:
f.write(json.dumps(req) + '\n')
batch_file = client.files.create(file=open('batch_input.jsonl', 'rb'), purpose='batch')
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint='/v1/chat/completions',
completion_window='24h'
) # 50% cost savings
Lever 5: Model Distillation
Distillation trains a smaller, cheaper model to mimic a larger model on your specific task. Use the large model to generate training data, then fine-tune a small model on it. This is the nuclear option - significant upfront investment, but can achieve 90%+ of frontier quality at 1-5% of the inference cost.
At HCLTech, we distilled a clinical entity extraction model from GPT-4o onto a fine-tuned Mistral 7B. The fine-tuned Mistral achieved 97% of GPT-4o's F1 score at approximately 3% of the per-token cost.
Putting It Together
- Semantic cache hit? Return cached. Cost: ~$0.
- Simple/short query? Route to cheap model. Cost: 30-50x cheaper.
- Non-real-time? Use batch API. Cost: 50% discount.
- High-volume narrow task? Use distilled fine-tuned model. Cost: 95%+ cheaper.
- Complex query requiring frontier capability? Use GPT-4o/Claude 3.5.
Most production AI features can achieve 70-90% cost reduction from their naive implementation without meaningful quality degradation. Treat inference cost as a first-class engineering constraint, not an afterthought.