Clinical trial enrollment is broken. As of 2025, roughly 85% of trials fail to recruit enough patients on time, and the median time from referral to enrollment decision is 14 days. That delay is not primarily a technology problem - it is an information retrieval problem. The eligibility criteria for a single trial can run to 40+ pages of dense medical logic: inclusion criteria, exclusion criteria, required prior therapies, washout periods, lab value thresholds. Matching a patient's EHR against that document is cognitively exhausting for coordinators doing it manually for dozens of active trials simultaneously.
When I joined the clinical trials team, we had six full-time coordinators spending roughly 60% of their time on pre-screening - reading trial protocols, pulling patient records, and making rough match/no-match decisions before a physician ever looked at the case. The bottleneck was not their effort; it was the sheer volume of unstructured text on both sides. Trial protocols are PDFs. Patient records are progress notes, discharge summaries, lab result PDFs, and imaging reports - all unstructured, all inconsistently formatted.
The Architecture We Built
We built a RAG pipeline with two corpus branches: one for trial protocols, one for patient records. For protocols, we chunked each document by eligibility criterion - not by page or paragraph. A single criterion (e.g., "Prior treatment with at least one platinum-based regimen") became one chunk. This domain-specific chunking turned out to be the single most impactful engineering decision we made. Early experiments using fixed 512-token chunks produced retrieval recall of 61%. Criterion-aligned chunks pushed that to 89%.
For patient records, we used a two-stage extraction pipeline: first, a clinical NER model (fine-tuned on i2b2 2010 data) extracted structured entities - diagnoses, medications, procedures, lab values with dates. Then we embedded those structured extractions rather than raw note text. This was counterintuitive: we threw away the free-text context and worked only with extracted entities. But it dramatically reduced noise and improved embedding quality for the specific matching task. We evaluated three embedding models - a general-purpose model, a biomedical BERT variant, and a proprietary clinical embeddings model - and the biomedical BERT variant consistently outperformed the others on our held-out validation set, despite having fewer parameters than the general-purpose model.
What Did Not Work Initially
Our first instinct was to use a large frontier model to do the matching end-to-end: give it the trial protocol and the patient summary and ask for a match decision. On our synthetic test cases this looked great. On real patients it was a disaster. The model hallucinated lab values that were not in the record, misread negations ("patient denies prior cardiac surgery" became a positive cardiac surgery flag), and was wildly overconfident. We also had a latency problem: a 40-page protocol plus a 12-month patient record exceeded context windows and took 45 seconds per patient-trial pair. With 200 active trials and 50 new patients per week, that math does not work.
The RAG approach solved both problems. By retrieving only the top-k relevant criterion chunks and matching them against extracted patient entities, we reduced context per query to under 2,000 tokens and cut latency to under 3 seconds. More importantly, the model was now reasoning over structured facts rather than raw text, which eliminated most of the hallucination surface.
Results and What We Learned
After a 90-day pilot across two oncology protocols, average coordinator pre-screening time dropped from 14 days to 3.8 days - a 73% reduction. False negative rate (patients incorrectly ruled out) was 4.2%, compared to an estimated 11% baseline for manual pre-screening based on a retrospective audit. Two trials met their enrollment targets ahead of schedule for the first time.
The lessons I took away: embedding quality matters more than model size for specialized retrieval tasks; domain-specific chunking strategy is the highest-use engineering decision in a RAG pipeline; and structured extraction before embedding is worth the added pipeline complexity when your source documents are noisy clinical text. The framework we built is now in production across eight trials, and we are extending it to rare disease indications where manual screening is even more burdensome.