Skip to main content
AI can speed up drafting, summarising, and evidence handling. It can also fail in predictable ways. In regulated or scientific contexts, even small errors can matter.

Why this page exists

AI is a capable collaborator for medical writing. It is not an infallible one. The failures below are the ones that recur in real projects — not hypothetical risks, but the kinds of errors that show up in drafts, summaries, and decks when AI output is accepted without verification. Recognising these patterns is the first step in catching them. Pairing AI with source checking and human judgement is the second.

The failure modes

1. Hallucinated citations

What happens: AI invents a plausible-looking reference — authors, journal, year, PMID — that does not exist.Example: A draft cites “Patel et al., Lancet Oncol 2022;23(4):512–521” to support a survival claim. The paper is not indexed in PubMed. The DOI resolves to nothing.Why it matters: A fabricated reference that survives into a manuscript, slide, or regulatory document undermines the credibility of every other citation in the deliverable.How to catch it: Resolve every reference against PubMed, the publisher site, or a DOI. Do not trust the citation because it “looks right.”

2. Misquoted statistics

What happens: A numerical value is slightly wrong — a transposed digit, a wrong confidence interval, a p-value shifted by one decimal, a rounded sample size.Example: Source: HR 0.72 (95% CI 0.61–0.85). AI output: HR 0.72 (95% CI 0.61–0.58).Why it matters: Fluent prose makes numerical errors almost invisible on a read-through. Downstream documents inherit the wrong value.How to catch it: Verify every number against the source table or figure. Do not rely on re-reading the AI output alone.

3. Wrong trial arm attribution

What happens: A result is assigned to the wrong arm, comparator, subgroup, or cohort.Example: A placebo-arm adverse event rate is described as the active treatment rate. A subgroup response is framed as the ITT result.Why it matters: The claim may read as scientifically sound while reversing the direction of the evidence.How to catch it: Cross-check attribution against the source table. Confirm the analysis population for every reported value.

4. Endpoint confusion

What happens: PFS is reported as OS. A secondary endpoint is presented as primary. A biomarker finding is framed as a clinical outcome.Example: AI writes “the trial demonstrated an overall survival benefit” when the reported result was progression-free survival.Why it matters: Endpoint confusion changes the clinical meaning of the finding, and in regulated materials it is a fair-balance and accuracy issue.How to catch it: Confirm endpoint definitions from the protocol or publication. Check primary vs. secondary status before paraphrasing.

5. Overstated conclusions

What happens: Cautious source language is upgraded to something stronger. “Suggests potential benefit” becomes “demonstrates superiority.” “Numerically higher” becomes “significantly higher.”Example: A non-significant trend (p=0.08) is summarised as “an improvement in response rate.”Why it matters: The output still reads like scientific writing, but it no longer matches what the evidence supports.How to catch it: Compare verbs and qualifiers line by line with the source. Watch for the quiet loss of hedging language.

6. Data-claim mismatch

What happens: The cited evidence supports one thing; the generated claim describes another.Example: The reference reports tumour response rate. The AI-written sentence references the same paper to support a survival or quality-of-life claim.Why it matters: The citation is real, but it does not support the claim. This is harder to spot than a fabricated reference because the reference exists.How to catch it: For each claim, confirm that the cited source actually supports that specific statement — not just the general topic.

7. Source conflation

What happens: AI blends details from two or more papers or trials into a single clean-sounding summary.Example: A summary describes “a Phase 3 trial in 842 patients showing a 14-month median PFS” — but the patient number comes from one trial and the PFS from another.Why it matters: The resulting statement does not describe any real study. Every verification path leads somewhere partially correct, which makes the error easy to miss.How to catch it: Require one source per claim. If a sentence combines facts from multiple sources, split it.

8. Population drift

What happens: A finding in a specific subgroup or narrow population is rewritten as if it applies more broadly.Example: A result seen in PD-L1–high patients is described as “patients with advanced NSCLC.” A second-line finding is framed as first-line evidence.Why it matters: The claim generalises beyond the evidence, which is a scientific accuracy issue and, in promotional contexts, a compliance issue.How to catch it: Confirm the analysis population and line of therapy for every claim. Flag any phrasing that widens the population.

9. Missing limitations

What happens: AI produces a tidy summary that drops caveats — small sample size, exploratory analysis, non-significance, immature data, indirect comparison, open-label design.Example: A summary of an exploratory post-hoc subgroup analysis reads like a pre-specified primary finding.Why it matters: Removing limitations changes how the evidence should be interpreted, even when every number is correct.How to catch it: Check whether the source flags the analysis as exploratory, post-hoc, or limited. If it does, the summary must too.

10. Compliance-sensitive phrasing drift

What happens: Neutral scientific wording becomes promotional, absolute, unbalanced, or insufficiently qualified.Example: “Was associated with improved outcomes in this study” becomes “improves outcomes.” Safety information shrinks while efficacy language expands.Why it matters: In regulated materials, phrasing drift can turn a defensible scientific statement into a claim that fails fair-balance or promotional-code review.How to catch it: Review output against the relevant code (ABPI, PhRMA, local equivalents). Watch for absolute verbs, superlatives, and reduced safety proportionality.

Summary table

A compact reference for reviewers and reviewers’ reviewers.
Failure modeTypical riskCommon contextBest check
Hallucinated citationsFabricated reference enters the deliverableDrafting from prompts without supplied sourcesResolve every reference via PubMed, DOI, or publisher
Misquoted statisticsWrong numerical value reads as fluent proseStats-to-narrative, results summariesVerify each number against the source table
Wrong trial arm attributionResult assigned to the wrong arm or populationMulti-arm trials, subgroup resultsCross-check attribution against the source table
Endpoint confusionSecondary reported as primary; PFS as OSOncology summaries, congress contentConfirm endpoint definitions from protocol or publication
Overstated conclusionsHedged source language upgraded to certaintyDiscussion sections, key messagesLine-by-line verb and qualifier check vs. source
Data-claim mismatchReal citation supporting a different claimManuscript drafting, claim librariesConfirm the cited source supports this specific claim
Source conflationDetails blended from multiple studiesCross-trial summaries, landscape reviewsOne source per claim; split blended sentences
Population driftNarrow finding framed as broadAdaptation, repurposing, PLSConfirm analysis population and line of therapy
Missing limitationsCaveats stripped from a tidy summarySummaries, plain language, slide decksCheck for exploratory, post-hoc, or immature data flags
Compliance-sensitive phrasing driftScientific wording becomes promotionalPromotional review, MSL and brand materialsReview against applicable promotional code

What this does not mean

These failure modes are not a reason to avoid AI in medical writing. They are a reason to pair AI with verification. The same tools that introduce these errors also make it possible to draft faster, summarise more consistently, and handle larger volumes of evidence than was previously practical. The value does not come from generation alone. It comes from connecting evidence, claims, and verification — keeping the AI inside a workflow where the source is always the ground truth and a human owns the sign-off. Verification is not an optional extra. In evidence-based medical writing, it is part of the workflow.

Further reading

Some of these failure patterns reflect broader behaviours of large language models under stress or ambiguous prompts. If you are interested in how LLMs fail at a more fundamental level, I explored this in more detail in How to Break a Large Language Model — published in AI Advances.


Last reviewed: 15 April 2026 · 8 min read