The first time I showed a clinical team GPT-4V analyzing a chest X-ray, the reaction was visceral. Not because it was perfect - it wasn't - but because it was describing findings in clinical language, noting ambiguous regions. In two minutes, a model was doing something that had required a radiologist for seventy years. That's the multimodal moment we're in.
The Multimodal Space
- Vision-language models (VLMs): GPT-4o, Claude 3 (all tiers), Gemini 1.5, LLaMA 3.2 Vision, PaliGemma.
- Audio models: Whisper (speech-to-text), ElevenLabs/Bark (text-to-speech), Gemini 1.5 (native audio understanding).
- Video models: Gemini 1.5 Pro (video understanding), RunwayML, Pika Labs (generation).
- Document understanding: Amazon Textract, Google Document AI, Donut, LayoutLMv3.
How Vision-Language Models Work
Modern VLMs divide images into patches, encode each with a vision encoder (ViT), then project those patch embeddings into the same space as text tokens. The language model processes a combined sequence of image patch embeddings and text token embeddings.
GPT-4V uses high-resolution tiling: it divides large images into multiple 512x512 tiles, each processed independently. This is why GPT-4V can read fine print in an image - it's examining tiled sub-images. You're also paying per tile. A 2048x2048 image at high detail = 16 tiles = expensive. Downscale images to the minimum resolution needed for your task.
from openai import OpenAI
import base64
from PIL import Image
import io
client = OpenAI()
def analyze_image(image_path: str, prompt: str, detail: str = 'high') -> str:
img = Image.open(image_path)
if max(img.size) > 1024 and detail == 'auto':
img.thumbnail((1024, 1024)) # Reduce cost
buffer = io.BytesIO()
img.save(buffer, format='PNG')
img_b64 = base64.b64encode(buffer.getvalue()).decode()
response = client.chat.completions.create(
model='gpt-4o',
messages=[{
'role': 'user',
'content': [
{'type': 'image_url', 'image_url': {'url': f'data:image/png;base64,{img_b64}', 'detail': detail}},
{'type': 'text', 'text': prompt}
]
}],
max_tokens=1000
)
return response.choices[0].message.content
Document Understanding: The High-Value Enterprise Use Case
Document understanding - extracting structured data from PDFs, invoices, contracts, forms - is one of the highest-ROI multimodal applications in enterprise settings. Traditional approaches required custom OCR pipelines, rule-based extraction, and constant maintenance as document formats changed. Modern VLMs handle complex table extraction, merged cells, and rotated headers far better than text-only extraction.
invoice_schema = """
Extract the following fields as JSON:
- invoice_number
- vendor_name
- invoice_date
- line_items: list of {description, quantity, unit_price, total}
- total_amount
If a field is not present, use null. Return only valid JSON.
"""
result = analyze_image('invoice.pdf', invoice_schema)
import json
structured_data = json.loads(result)
At HCLTech, we used this approach to extract structured data from clinical trial protocols - 200-page PDFs with complex tables, nested eligibility criteria, and cross-references. The multimodal approach outperformed text-only extraction significantly on table handling.
Medical Imaging: The Opportunity and the Reality Check
General-purpose VLMs are not clinical-grade tools. GPT-4V on chest X-ray findings performs comparably to a first-year resident in some studies, but significantly underperforms for subtle findings or rare conditions. Never deploy a general-purpose VLM for clinical decision-making without extensive validation.
Specialized medical imaging models are far better for clinical tasks. Med-PaLM M (Google), BioViL-T (Microsoft) dramatically outperform general VLMs on their target tasks.
The immediate safe opportunity is documentation and workflow, not diagnosis. Using a VLM to generate draft radiology report text for a radiologist to review is a legitimate, valuable workflow. The radiologist remains the decision-maker; AI is a documentation assistant.
Audio: The Underutilized Modality
Whisper (OpenAI) achieves near-human accuracy on transcription across 99 languages, handles domain vocabulary and accents remarkably well, and runs locally. At HCLTech, ambient clinical documentation - recording the doctor-patient encounter and auto-generating structured clinical notes - is one of the highest-ROI AI applications we've built.
import whisper
model = whisper.load_model('large-v3')
result = model.transcribe(
'clinical_encounter.mp3',
language='en',
initial_prompt='Medical encounter between physician and patient. Use standard medical terminology.'
)
print(result['text'])
Gemini 1.5 processes audio natively - understanding tone, pace, sentiment, and multiple speakers. This opens applications like customer call analysis, meeting understanding, and educational speaking feedback.
Video Understanding
Gemini 1.5 Pro can process up to 1 hour of video (frames sampled at 1fps) with its 1M context window. Applications: surgical procedure documentation, safety monitoring in manufacturing, training content extraction, sports performance analysis. Cost is significant at scale - sample at lower frame rates unless temporal resolution is critical.
Practical Considerations
- Image quality gates matter. Blurry, poorly lit, or incorrectly rotated images degrade performance severely. Implement quality checks before sending to the model.
- Multimodal models hallucinate about images. For precision-sensitive use cases, pair VLM extraction with validation - cross-check extracted numbers against other signals.
- Test with actual user artifact types. Camera phone photos have very different characteristics than scanner output. Your demo images are not your production images.
- Consider specialized models for specialized domains. A dermatology-focused model will outperform GPT-4V for dermatology classification.
Multimodal AI is where a lot of the most interesting product work is happening right now. The teams building competitive moats are the ones going beyond text - tackling the documents, images, audio, and video that previous AI generations simply couldn't touch.