AI can speed up drafting, summarising, and evidence handling. It can also fail in predictable ways. In regulated or scientific contexts, even small errors can matter.
Why this page exists
AI is a capable collaborator for medical writing. It is not an infallible one. The failures below are the ones that recur in real projects — not hypothetical risks, but the kinds of errors that show up in drafts, summaries, and decks when AI output is accepted without verification. Recognising these patterns is the first step in catching them. Pairing AI with source checking and human judgement is the second.The failure modes
1. Hallucinated citations
What happens: AI invents a plausible-looking reference — authors, journal, year, PMID — that does not exist.Example: A draft cites “Patel et al., Lancet Oncol 2022;23(4):512–521” to support a survival claim. The paper is not indexed in PubMed. The DOI resolves to nothing.Why it matters: A fabricated reference that survives into a manuscript, slide, or regulatory document undermines the credibility of every other citation in the deliverable.How to catch it: Resolve every reference against PubMed, the publisher site, or a DOI. Do not trust the citation because it “looks right.”
2. Misquoted statistics
What happens: A numerical value is slightly wrong — a transposed digit, a wrong confidence interval, a p-value shifted by one decimal, a rounded sample size.Example: Source: HR 0.72 (95% CI 0.61–0.85). AI output: HR 0.72 (95% CI 0.61–0.58).Why it matters: Fluent prose makes numerical errors almost invisible on a read-through. Downstream documents inherit the wrong value.How to catch it: Verify every number against the source table or figure. Do not rely on re-reading the AI output alone.
3. Wrong trial arm attribution
What happens: A result is assigned to the wrong arm, comparator, subgroup, or cohort.Example: A placebo-arm adverse event rate is described as the active treatment rate. A subgroup response is framed as the ITT result.Why it matters: The claim may read as scientifically sound while reversing the direction of the evidence.How to catch it: Cross-check attribution against the source table. Confirm the analysis population for every reported value.
4. Endpoint confusion
What happens: PFS is reported as OS. A secondary endpoint is presented as primary. A biomarker finding is framed as a clinical outcome.Example: AI writes “the trial demonstrated an overall survival benefit” when the reported result was progression-free survival.Why it matters: Endpoint confusion changes the clinical meaning of the finding, and in regulated materials it is a fair-balance and accuracy issue.How to catch it: Confirm endpoint definitions from the protocol or publication. Check primary vs. secondary status before paraphrasing.
5. Overstated conclusions
What happens: Cautious source language is upgraded to something stronger. “Suggests potential benefit” becomes “demonstrates superiority.” “Numerically higher” becomes “significantly higher.”Example: A non-significant trend (p=0.08) is summarised as “an improvement in response rate.”Why it matters: The output still reads like scientific writing, but it no longer matches what the evidence supports.How to catch it: Compare verbs and qualifiers line by line with the source. Watch for the quiet loss of hedging language.
6. Data-claim mismatch
What happens: The cited evidence supports one thing; the generated claim describes another.Example: The reference reports tumour response rate. The AI-written sentence references the same paper to support a survival or quality-of-life claim.Why it matters: The citation is real, but it does not support the claim. This is harder to spot than a fabricated reference because the reference exists.How to catch it: For each claim, confirm that the cited source actually supports that specific statement — not just the general topic.
7. Source conflation
What happens: AI blends details from two or more papers or trials into a single clean-sounding summary.Example: A summary describes “a Phase 3 trial in 842 patients showing a 14-month median PFS” — but the patient number comes from one trial and the PFS from another.Why it matters: The resulting statement does not describe any real study. Every verification path leads somewhere partially correct, which makes the error easy to miss.How to catch it: Require one source per claim. If a sentence combines facts from multiple sources, split it.
8. Population drift
What happens: A finding in a specific subgroup or narrow population is rewritten as if it applies more broadly.Example: A result seen in PD-L1–high patients is described as “patients with advanced NSCLC.” A second-line finding is framed as first-line evidence.Why it matters: The claim generalises beyond the evidence, which is a scientific accuracy issue and, in promotional contexts, a compliance issue.How to catch it: Confirm the analysis population and line of therapy for every claim. Flag any phrasing that widens the population.
9. Missing limitations
What happens: AI produces a tidy summary that drops caveats — small sample size, exploratory analysis, non-significance, immature data, indirect comparison, open-label design.Example: A summary of an exploratory post-hoc subgroup analysis reads like a pre-specified primary finding.Why it matters: Removing limitations changes how the evidence should be interpreted, even when every number is correct.How to catch it: Check whether the source flags the analysis as exploratory, post-hoc, or limited. If it does, the summary must too.
10. Compliance-sensitive phrasing drift
What happens: Neutral scientific wording becomes promotional, absolute, unbalanced, or insufficiently qualified.Example: “Was associated with improved outcomes in this study” becomes “improves outcomes.” Safety information shrinks while efficacy language expands.Why it matters: In regulated materials, phrasing drift can turn a defensible scientific statement into a claim that fails fair-balance or promotional-code review.How to catch it: Review output against the relevant code (ABPI, PhRMA, local equivalents). Watch for absolute verbs, superlatives, and reduced safety proportionality.
Summary table
A compact reference for reviewers and reviewers’ reviewers.| Failure mode | Typical risk | Common context | Best check |
|---|---|---|---|
| Hallucinated citations | Fabricated reference enters the deliverable | Drafting from prompts without supplied sources | Resolve every reference via PubMed, DOI, or publisher |
| Misquoted statistics | Wrong numerical value reads as fluent prose | Stats-to-narrative, results summaries | Verify each number against the source table |
| Wrong trial arm attribution | Result assigned to the wrong arm or population | Multi-arm trials, subgroup results | Cross-check attribution against the source table |
| Endpoint confusion | Secondary reported as primary; PFS as OS | Oncology summaries, congress content | Confirm endpoint definitions from protocol or publication |
| Overstated conclusions | Hedged source language upgraded to certainty | Discussion sections, key messages | Line-by-line verb and qualifier check vs. source |
| Data-claim mismatch | Real citation supporting a different claim | Manuscript drafting, claim libraries | Confirm the cited source supports this specific claim |
| Source conflation | Details blended from multiple studies | Cross-trial summaries, landscape reviews | One source per claim; split blended sentences |
| Population drift | Narrow finding framed as broad | Adaptation, repurposing, PLS | Confirm analysis population and line of therapy |
| Missing limitations | Caveats stripped from a tidy summary | Summaries, plain language, slide decks | Check for exploratory, post-hoc, or immature data flags |
| Compliance-sensitive phrasing drift | Scientific wording becomes promotional | Promotional review, MSL and brand materials | Review against applicable promotional code |
What this does not mean
These failure modes are not a reason to avoid AI in medical writing. They are a reason to pair AI with verification. The same tools that introduce these errors also make it possible to draft faster, summarise more consistently, and handle larger volumes of evidence than was previously practical. The value does not come from generation alone. It comes from connecting evidence, claims, and verification — keeping the AI inside a workflow where the source is always the ground truth and a human owns the sign-off. Verification is not an optional extra. In evidence-based medical writing, it is part of the workflow.Further reading
Some of these failure patterns reflect broader behaviours of large language models under stress or ambiguous prompts. If you are interested in how LLMs fail at a more fundamental level, I explored this in more detail in How to Break a Large Language Model — published in AI Advances.
Related principles
- Source Grounding — every claim traces to a cited source
- Understanding AI Risk — why risk varies across tasks
- AI Risk Framework — tier-based review expectations
- Review and Accountability — sign-off protocols and audit trails
Related workflows
- Verify Claims Against References — claim-by-claim verification
- Check Document Consistency — catching drift across a document
- Check Promotional Compliance — phrasing and fair-balance review
- Final Human Review — the human sign-off step
Last reviewed: 15 April 2026 · 8 min read