Understanding AI Inference Costs at Scale

I've seen AI features kill products not because they didn't work, but because they worked at a price point that made the unit economics impossible. At HCLTech, an early prototype of a clinical documentation assistant was generating $0.18 per patient encounter in inference costs - $3.6M annually at projected volume. Understanding and attacking each cost lever got that number down to $0.03.

Token Economics Fundamentals

LLM pricing is charged per token (approximately 0.75 words). Output tokens typically cost 3-5x more than input tokens because they require sequential generation. Current 2024 pricing (approximate):

GPT-4o: $5/M input, $15/M output
GPT-4o-mini: $0.15/M input, $0.60/M output (33x cheaper than GPT-4o)
Claude 3.5 Sonnet: $3/M input, $15/M output
Claude 3 Haiku: $0.25/M input, $1.25/M output
Gemini 1.5 Flash: $0.075/M input, $0.30/M output

def estimate_cost(input_tokens: int, output_tokens: int, model: str = 'gpt-4o') -> float:
    pricing = {
        'gpt-4o': (5.00, 15.00),
        'gpt-4o-mini': (0.15, 0.60),
        'claude-3-5-sonnet': (3.00, 15.00),
        'gemini-1.5-flash': (0.075, 0.30),
    }
    ip, op = pricing[model]
    return (input_tokens / 1_000_000) * ip + (output_tokens / 1_000_000) * op

# 10K docs, 2K input + 500 output tokens each:
per_doc_gpt4o = estimate_cost(2000, 500, 'gpt-4o')     # $0.0175
per_doc_mini  = estimate_cost(2000, 500, 'gpt-4o-mini') # $0.0006
print(f'GPT-4o: ${per_doc_gpt4o * 10000:.2f} for 10K docs')
print(f'Mini:   ${per_doc_mini * 10000:.2f} for 10K docs')

Lever 1: Model Selection and Routing

Switching from GPT-4o to GPT-4o-mini costs 97% less. GPT-4o-mini excels at: classification, structured extraction, simple Q&A over context, summarization, first-pass triage. GPT-4o is worth the premium for: complex multi-step reasoning, code generation, nuanced judgment tasks.

def route_query(query: str, context_length: int) -> str:
    if context_length > 100_000:
        return 'claude-3-5-sonnet'  # Large context
    if len(query.split()) < 20 and context_length < 2000:
        return 'gpt-4o-mini'        # Simple, short query
    return 'gpt-4o'

Lever 2: Prompt Optimization

A 2,000-token system prompt running 1M calls/day costs $10/day on GPT-4o-mini or $2,500/day on GPT-4o. Compress prompts without losing quality using gist compression or LLMLingua (Microsoft's open-source library that reduces prompt length by 3-20x).

from llmlingua import PromptCompressor

llm_lingua = PromptCompressor()
compressed = llm_lingua.compress_prompt(
    original_prompt, instruction='', question=user_query, target_token=400
)
print(f'Compression ratio: {compressed["ratio"]:.2f}x')

Lever 3: Caching

Caching is the highest-ROI cost lever. Exact match caching: hash the prompt, return cached response if seen before. Semantic caching: embed the query, check for semantically similar cached queries (20-40% hit rates common). OpenAI prompt caching: identical prompt prefixes charged at 50% automatically.

import hashlib, json, redis

cache = redis.Redis(host='localhost', port=6379)

def cached_llm_call(prompt: str, model: str, ttl: int = 3600) -> str:
    key = hashlib.sha256(f'{model}:{prompt}'.encode()).hexdigest()
    cached = cache.get(key)
    if cached:
        return json.loads(cached)['response']
    response = call_llm(prompt, model)
    cache.setex(key, ttl, json.dumps({'response': response}))
    return response

Lever 4: Batching

OpenAI's Batch API processes requests asynchronously with 24-hour turnaround at half price. Appropriate for: nightly document processing, bulk classification, embedding generation, scheduled reports.

from openai import OpenAI
import json

client = OpenAI()
batch_requests = [
    {
        'custom_id': f'doc_{i}',
        'method': 'POST',
        'url': '/v1/chat/completions',
        'body': {'model': 'gpt-4o-mini',
                 'messages': [{'role': 'user', 'content': text}],
                 'max_tokens': 500}
    }
    for i, text in enumerate(documents)
]
with open('batch_input.jsonl', 'w') as f:
    for req in batch_requests:
        f.write(json.dumps(req) + '\n')
batch_file = client.files.create(file=open('batch_input.jsonl', 'rb'), purpose='batch')
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint='/v1/chat/completions',
    completion_window='24h'
)  # 50% cost savings

Lever 5: Model Distillation

Distillation trains a smaller, cheaper model to mimic a larger model on your specific task. Use the large model to generate training data, then fine-tune a small model on it. This is the nuclear option - significant upfront investment, but can achieve 90%+ of frontier quality at 1-5% of the inference cost.

At HCLTech, we distilled a clinical entity extraction model from GPT-4o onto a fine-tuned Mistral 7B. The fine-tuned Mistral achieved 97% of GPT-4o's F1 score at approximately 3% of the per-token cost.

Putting It Together

Semantic cache hit? Return cached. Cost: ~$0.
Simple/short query? Route to cheap model. Cost: 30-50x cheaper.
Non-real-time? Use batch API. Cost: 50% discount.
High-volume narrow task? Use distilled fine-tuned model. Cost: 95%+ cheaper.
Complex query requiring frontier capability? Use GPT-4o/Claude 3.5.

Most production AI features can achieve 70-90% cost reduction from their naive implementation without meaningful quality degradation. Treat inference cost as a first-class engineering constraint, not an afterthought.

Understanding AI Inference Costs at Scale

Token Economics Fundamentals

Lever 1: Model Selection and Routing

Lever 2: Prompt Optimization

Lever 3: Caching

Lever 4: Batching

Lever 5: Model Distillation

Putting It Together

More on this

Before you go

Token Economics Fundamentals

Lever 1: Model Selection and Routing

Lever 2: Prompt Optimization

Lever 3: Caching

Lever 4: Batching

Lever 5: Model Distillation

Putting It Together

More on this

Keep reading

Deploying Computer Vision for Surgical Instrument Tracking: What the Papers Don't Tell You

The Agentic AI Stack: Architecture Patterns for Production Systems

State of AI in Healthcare 2026: Executive Summary