Skip to content
OpenTrain AIFor AI Companies
← Back to explorer

How much of an LLM-generated clinical corpus is actually new? A production-scale measurement of content redundancy for provenance classification

Ali H. Lazem, William J. Teahan · Jun 28, 2026 · Citations: 0

How to use this page

Moderate trust

Use this for comparison and orientation, not as your only source.

Best use

Secondary protocol comparison source

What to verify

Read the full paper before copying any benchmark, metric, or protocol choices.

Evidence quality

Moderate

Derived from extracted protocol signals and abstract evidence.

Abstract

Clinical machine learning increasingly relies on training corpora generated by large language models (LLMs) rather than annotated by clinicians, and such corpora are described and reused largely on the basis of their reported scale. We test whether volume reflects information content. Analysing the complete output of a multi-agent clinical extraction pipeline applied to 167,034 patient narratives, 2.51 billion generated tokens across the ten text-bearing channels of an eleven-channel pipeline, we introduce Provenance-based Redundancy Decomposition, a token-level classification of the entire output by source. Only 10.9% of the output is trainable-unique content while 79.4% is redundant; raw token count overstates information content by roughly ninefold. The redundancy arises through two distinct mechanisms, verbatim copying of source context into per-item fields, and duplication of generated text across records, of which only the former is losslessly removable. An independent, model-free analysis based on lossless compression confirms the redundancy, recovering the two mechanisms without reference to the provenance labels. One pipeline channel carries almost no redundancy, showing that the level of redundancy depends on how each channel is structured rather than being a fixed property of LLM extraction. Because uncorrected redundancy up-weights the longer, more complex presentations that generate the most items, it skews the token-level training distribution of the corpus, a property we measure directly. In a controlled downstream test, de-duplicating the corpus before adaptation improved a clinical encoder on external disease-recognition benchmarks at equal token budget, robustly across adaptation depths and replicated on a second benchmark, confirming that the redundancy carries a measurable cost beyond storage. The classification tool is released openly.

Low-signal caution for protocol decisions

Use this page for context, then validate protocol choices against stronger HFEPX references before implementation decisions.

  • The abstract does not clearly name benchmarks or metrics.

Should You Rely On This Paper?

This paper has useful evaluation signal, but protocol completeness is partial; pair it with related papers before deciding implementation strategy.

Best use

Secondary protocol comparison source

Use if you need

A secondary eval reference to pair with stronger protocol papers.

Main weakness

The abstract does not clearly name benchmarks or metrics.

Trust level

Moderate

Usefulness score

55/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Detected

Usefulness for eval research

Moderate-confidence candidate

Extraction confidence 70%

What We Could Verify

These are the protocol signals we could actually recover from the available paper metadata. Use them to decide whether this paper is worth deeper reading.

Human Feedback Types

strong

Expert Verification

Directly usable for protocol triage.

"Clinical machine learning increasingly relies on training corpora generated by large language models (LLMs) rather than annotated by clinicians, and such corpora are described and reused largely on the basis of their reported scale."

Evaluation Modes

strong

Automatic Metrics

Includes extracted eval setup.

"Clinical machine learning increasingly relies on training corpora generated by large language models (LLMs) rather than annotated by clinicians, and such corpora are described and reused largely on the basis of their reported scale."

Quality Controls

missing

Not reported

No explicit QC controls found.

"Clinical machine learning increasingly relies on training corpora generated by large language models (LLMs) rather than annotated by clinicians, and such corpora are described and reused largely on the basis of their reported scale."

Benchmarks / Datasets

missing

Not extracted

No benchmark anchors detected.

"Clinical machine learning increasingly relies on training corpora generated by large language models (LLMs) rather than annotated by clinicians, and such corpora are described and reused largely on the basis of their reported scale."

Reported Metrics

missing

Not extracted

No metric anchors detected.

"Clinical machine learning increasingly relies on training corpora generated by large language models (LLMs) rather than annotated by clinicians, and such corpora are described and reused largely on the basis of their reported scale."

Rater Population

strong

Domain Experts

Helpful for staffing comparability.

"Clinical machine learning increasingly relies on training corpora generated by large language models (LLMs) rather than annotated by clinicians, and such corpora are described and reused largely on the basis of their reported scale."

Human Feedback Details

  • Uses human feedback: Yes
  • Feedback types: Expert Verification
  • Rater population: Domain Experts
  • Expertise required: Medicine

Evaluation Details

  • Evaluation modes: Automatic Metrics
  • Agentic eval: Multi Agent
  • Quality controls: Not reported
  • Evidence quality: Moderate
  • Use this page as: Secondary protocol comparison source

Protocol And Measurement Signals

Benchmarks / Datasets

No benchmark or dataset names were extracted from the available abstract.

Reported Metrics

No metric terms were extracted from the available abstract.

Research Brief

Metadata summary

Clinical machine learning increasingly relies on training corpora generated by large language models (LLMs) rather than annotated by clinicians, and such corpora are described and reused largely on the basis of their reported scale.

Based on abstract + metadata only. Check the source paper before making high-confidence protocol decisions.

Key Takeaways

  • Clinical machine learning increasingly relies on training corpora generated by large language models (LLMs) rather than annotated by clinicians, and such corpora are described and reused largely on the basis of their reported scale.
  • We test whether volume reflects information content.
  • Analysing the complete output of a multi-agent clinical extraction pipeline applied to 167,034 patient narratives, 2.51 billion generated tokens across the ten text-bearing channels of an eleven-channel pipeline, we introduce Provenance-based Redundancy Decomposition, a token-level classification of the entire output by source.

Researcher Actions

  • Compare this paper against nearby papers in the same arXiv category before using it for protocol decisions.
  • Check the full text for explicit evaluation design choices (raters, protocol, and metrics).
  • Use related-paper links to find stronger protocol-specific references.

Caveats

  • Generated from abstract + metadata only; no PDF parsing.
  • Signals below are heuristic and may miss details reported outside the abstract.

Recommended Queries

Research Summary

Contribution Summary

  • Analysing the complete output of a multi-agent clinical extraction pipeline applied to 167,034 patient narratives, 2.51 billion generated tokens across the ten text-bearing channels of an eleven-channel pipeline, we introduce…
  • Only 10.9% of the output is trainable-unique content while 79.4% is redundant; raw token count overstates information content by roughly ninefold.
  • In a controlled downstream test, de-duplicating the corpus before adaptation improved a clinical encoder on external disease-recognition benchmarks at equal token budget, robustly across adaptation depths and replicated on a second…

Why It Matters For Eval

  • Analysing the complete output of a multi-agent clinical extraction pipeline applied to 167,034 patient narratives, 2.51 billion generated tokens across the ten text-bearing channels of an eleven-channel pipeline, we introduce…
  • In a controlled downstream test, de-duplicating the corpus before adaptation improved a clinical encoder on external disease-recognition benchmarks at equal token budget, robustly across adaptation depths and replicated on a second…

Researcher Checklist

  • Pass: Human feedback protocol is explicit

    Detected: Expert Verification

  • Pass: Evaluation mode is explicit

    Detected: Automatic Metrics

  • Gap: Quality control reporting appears

    No calibration/adjudication/IAA control explicitly detected.

  • Gap: Benchmark or dataset anchors are present

    No benchmark/dataset anchor extracted from abstract.

  • Gap: Metric reporting is present

    No metric terms extracted.

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.