Risk tier: Medium
~15 min with AI per paper, ~45 min without
Enhanced review with source cross-check required for every extracted value.Source papers → AI data extraction → Value-by-value verification → Structured evidence table
Best for
- Building evidence tables for publications, HEOR submissions, or regulatory documents
- Extracting endpoints, outcomes, and safety data from multiple papers for cross-study comparison
- Creating competitor or landscape summaries from published trial data
- Preparing structured data summaries for slide decks, advisory boards, or internal briefings
- Supporting systematic literature review data extraction (non-regulatory SLRs)
Inputs
- Full text of the source paper(s) (PDF or pasted text, not abstracts alone)
- A defined extraction template specifying which data points to capture (study design, population, endpoints, results, safety)
- Context on the intended use (evidence table format, comparison framework, or specific data needs)
Steps
Define the extraction template
Decide which data points you need before starting. Common fields: study name/acronym, design, population (N, key criteria), primary endpoint, primary result, key secondary results, safety summary, and follow-up duration. Match the template to the downstream deliverable.
Provide the full paper to the AI
Paste or upload the complete text. Do not extract from abstracts alone. Abstracts omit subgroup details, secondary endpoints, and safety data that the full paper contains.
Run the extraction
Use the prompt pattern below to extract structured data into your template format. For posters, use PosterLens first to convert visual content into structured text before extracting.
Verify every value against the source
Open the paper’s results tables and check every extracted number side by side. AI commonly transposes values between study arms, confuses ITT and per-protocol populations, or rounds numbers differently from the source. Every value must match exactly.
Check completeness
Confirm that all required fields are populated and that no key findings have been omitted. Safety data is frequently under-extracted. If the paper reports it, your extraction should capture it.
Output
A structured data table or summary with one row per study, containing the specified data fields with exact values from the source papers. Every cell should be traceable to a specific location in the source text. The output uses consistent units, terminology, and formatting across all studies.Prompt pattern
Why this works
AI extracts and structures data from dense clinical papers in minutes, consistently applying the same template across multiple studies. This handles the mechanical work of locating values in results sections, tables, and figures. The human writer verifies every value against the source, resolves ambiguities (which population? which analysis?), and makes the judgement calls about completeness and relevance that the extraction template cannot capture.Common mistakes
Transposed values between study arms
Transposed values between study arms
AI assigns the treatment arm’s result to the placebo arm, or swaps the hazard ratio direction. This is the most common extraction error and the hardest to catch by reading alone. Verify each value against the specific table or figure in the source paper.
Mixing populations or analysis types
Mixing populations or analysis types
A paper reports both ITT and per-protocol results. AI extracts the per-protocol result but labels it as ITT. If the paper reports multiple analysis populations, confirm which one your extraction captures and label it explicitly.
Under-extracting safety data
Under-extracting safety data
AI extracts the efficacy endpoints in full detail but reduces safety to “adverse events were consistent with the known safety profile.” If the paper reports specific AE rates, grade ≥3 events, and discontinuation rates, your extraction should capture these numbers.
Inconsistent units across studies
Inconsistent units across studies
One paper reports median OS in months; another reports it in weeks. AI may not flag this inconsistency. When extracting across multiple papers, standardise units or clearly label them to prevent miscomparison.
Extracting from abstracts instead of full text
Extracting from abstracts instead of full text
Abstracts are often written before final analysis and may contain rounded or preliminary values that differ from the full paper. Always extract from the full publication, not the abstract.
Tool stack
| Tool | Role |
|---|---|
| PosterLens | Extract structured content from scientific posters before data extraction |
| PubCrawl | Find source papers if starting from an indication or research question |
Review checklist
Human review checklist
Human review checklist
- Every numerical value matches the source paper exactly
- Values are attributed to the correct study arm and population
- The analysis population (ITT, mITT, PP) is correctly identified for each result
- Safety data is extracted with the same detail as efficacy data
- Subgroup and post-hoc results are clearly labelled
- Units are consistent across studies (or clearly labelled where they differ)
- All required template fields are populated
- “Not reported” is used where data is genuinely absent, not where extraction missed it
- The extraction is traceable to specific sections of the source paper
Next steps: Use extracted data to Extract Key Messages, build a Content Outline, or feed into a Manuscript, Regulatory Document, or Slide Deck. For table-to-text conversion, see Convert Stats to Narrative.