Multimodal AI: Beyond Text — Vision, Audio, and Video

The first time I showed a clinical team GPT-4V analyzing a chest X-ray, the reaction was visceral. Not because it was perfect - it wasn't - but because it was describing findings in clinical language, noting ambiguous regions. In two minutes, a model was doing something that had required a radiologist for seventy years. That's the multimodal moment we're in.

The Multimodal Space

Vision-language models (VLMs): GPT-4o, Claude 3 (all tiers), Gemini 1.5, LLaMA 3.2 Vision, PaliGemma.
Audio models: Whisper (speech-to-text), ElevenLabs/Bark (text-to-speech), Gemini 1.5 (native audio understanding).
Video models: Gemini 1.5 Pro (video understanding), RunwayML, Pika Labs (generation).
Document understanding: Amazon Textract, Google Document AI, Donut, LayoutLMv3.

How Vision-Language Models Work

Modern VLMs divide images into patches, encode each with a vision encoder (ViT), then project those patch embeddings into the same space as text tokens. The language model processes a combined sequence of image patch embeddings and text token embeddings.

GPT-4V uses high-resolution tiling: it divides large images into multiple 512x512 tiles, each processed independently. This is why GPT-4V can read fine print in an image - it's examining tiled sub-images. You're also paying per tile. A 2048x2048 image at high detail = 16 tiles = expensive. Downscale images to the minimum resolution needed for your task.

from openai import OpenAI
import base64
from PIL import Image
import io

client = OpenAI()

def analyze_image(image_path: str, prompt: str, detail: str = 'high') -> str:
    img = Image.open(image_path)
    if max(img.size) > 1024 and detail == 'auto':
        img.thumbnail((1024, 1024))  # Reduce cost
    buffer = io.BytesIO()
    img.save(buffer, format='PNG')
    img_b64 = base64.b64encode(buffer.getvalue()).decode()
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{
            'role': 'user',
            'content': [
                {'type': 'image_url', 'image_url': {'url': f'data:image/png;base64,{img_b64}', 'detail': detail}},
                {'type': 'text', 'text': prompt}
            ]
        }],
        max_tokens=1000
    )
    return response.choices[0].message.content

Document Understanding: The High-Value Enterprise Use Case

Document understanding - extracting structured data from PDFs, invoices, contracts, forms - is one of the highest-ROI multimodal applications in enterprise settings. Traditional approaches required custom OCR pipelines, rule-based extraction, and constant maintenance as document formats changed. Modern VLMs handle complex table extraction, merged cells, and rotated headers far better than text-only extraction.

invoice_schema = """
Extract the following fields as JSON:
- invoice_number
- vendor_name
- invoice_date
- line_items: list of {description, quantity, unit_price, total}
- total_amount

If a field is not present, use null. Return only valid JSON.
"""

result = analyze_image('invoice.pdf', invoice_schema)
import json
structured_data = json.loads(result)

At HCLTech, we used this approach to extract structured data from clinical trial protocols - 200-page PDFs with complex tables, nested eligibility criteria, and cross-references. The multimodal approach outperformed text-only extraction significantly on table handling.

Medical Imaging: The Opportunity and the Reality Check

General-purpose VLMs are not clinical-grade tools. GPT-4V on chest X-ray findings performs comparably to a first-year resident in some studies, but significantly underperforms for subtle findings or rare conditions. Never deploy a general-purpose VLM for clinical decision-making without extensive validation.

Specialized medical imaging models are far better for clinical tasks. Med-PaLM M (Google), BioViL-T (Microsoft) dramatically outperform general VLMs on their target tasks.

The immediate safe opportunity is documentation and workflow, not diagnosis. Using a VLM to generate draft radiology report text for a radiologist to review is a legitimate, valuable workflow. The radiologist remains the decision-maker; AI is a documentation assistant.

Audio: The Underutilized Modality

Whisper (OpenAI) achieves near-human accuracy on transcription across 99 languages, handles domain vocabulary and accents remarkably well, and runs locally. At HCLTech, ambient clinical documentation - recording the doctor-patient encounter and auto-generating structured clinical notes - is one of the highest-ROI AI applications we've built.

import whisper

model = whisper.load_model('large-v3')
result = model.transcribe(
    'clinical_encounter.mp3',
    language='en',
    initial_prompt='Medical encounter between physician and patient. Use standard medical terminology.'
)
print(result['text'])

Gemini 1.5 processes audio natively - understanding tone, pace, sentiment, and multiple speakers. This opens applications like customer call analysis, meeting understanding, and educational speaking feedback.

Video Understanding

Gemini 1.5 Pro can process up to 1 hour of video (frames sampled at 1fps) with its 1M context window. Applications: surgical procedure documentation, safety monitoring in manufacturing, training content extraction, sports performance analysis. Cost is significant at scale - sample at lower frame rates unless temporal resolution is critical.

Practical Considerations

Image quality gates matter. Blurry, poorly lit, or incorrectly rotated images degrade performance severely. Implement quality checks before sending to the model.
Multimodal models hallucinate about images. For precision-sensitive use cases, pair VLM extraction with validation - cross-check extracted numbers against other signals.
Test with actual user artifact types. Camera phone photos have very different characteristics than scanner output. Your demo images are not your production images.
Consider specialized models for specialized domains. A dermatology-focused model will outperform GPT-4V for dermatology classification.

Multimodal AI is where a lot of the most interesting product work is happening right now. The teams building competitive moats are the ones going beyond text - tackling the documents, images, audio, and video that previous AI generations simply couldn't touch.

Multimodal AI: Beyond Text — Vision, Audio, and Video

The Multimodal Space

How Vision-Language Models Work

Document Understanding: The High-Value Enterprise Use Case

Medical Imaging: The Opportunity and the Reality Check

Audio: The Underutilized Modality

Video Understanding

Practical Considerations

Related reading

Before you go

The Multimodal Space

How Vision-Language Models Work

Document Understanding: The High-Value Enterprise Use Case

Medical Imaging: The Opportunity and the Reality Check

Audio: The Underutilized Modality

Video Understanding

Practical Considerations

Related reading

Keep reading

RICE Prioritization for Regulated Industries

The Agentic AI Stack: Architecture Patterns for Production Systems

Stakeholder Mapping Canvas for Enterprise AI Products