Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

Tuhin Chakrabarty, Jane C. Ginsburg, Paramveer Dhillon · Oct 15, 2025 · Citations: 0

Automatic Metrics Law Pairwise Preference

Open arXiv Find Implementation RSS feed Shortlist (0)

How to use this page

Moderate trust

Use this for comparison and orientation, not as your only source.

Best use

Secondary protocol comparison source

What to verify

Read the full paper before copying any benchmark, metric, or protocol choices.

Evidence quality

Moderate

Derived from extracted protocol signals and abstract evidence.

Abstract

The use of copyrighted books for training AI has sparked lawsuits from authors concerned about AI generating derivative content. Yet whether these models can produce high-quality literary text emulating authors' voices remains unclear. We conducted a preregistered study comparing MFA-trained writers with three frontier models (ChatGPT, Claude, Gemini) writing up to 450-word excerpts emulating 50 award-winning authors' styles. In blind pairwise evaluations by 28 MFA-trained readers and 516 college-educated general readers, AI text from in-context prompting was strongly disfavored by MFA readers for stylistic fidelity (OR=0.16) and quality (OR=0.13), while general readers showed no fidelity preference (OR=1.06) but favored AI for quality (OR=1.82). Fine-tuning ChatGPT on authors' complete works reversed these results: MFA readers favored AI for fidelity (OR=8.16) and quality (OR=1.87), with general readers showing even stronger preference (fidelity OR=16.65; quality OR=5.42). Both groups preferred fine-tuned AI, but the writer-type X reader-type interaction remained significant (p=0.021 for fidelity; p<10^-4 for quality), indicating general readers favored AI by a wider margin. Effects are robust under cluster-robust inference and generalize across authors in heterogeneity analyses. Fine-tuned outputs were rarely flagged as AI-generated (3% vs. 97% for prompting) by leading detectors. Mediation analysis shows fine-tuning eliminates detectable AI quirks that penalize in-context outputs, altering the nexus between detectability and preference. While not accounting for effort to transform AI output into publishable prose, the median fine-tuning cost of $81 per author represents a 99.7% reduction versus typical writer compensation. Author-specific fine-tuning enables non-verbatim AI writing preferred over expert human writing, providing evidence relevant to copyright's fourth fair-use factor.

Low-signal caution for protocol decisions

Use this page for context, then validate protocol choices against stronger HFEPX references before implementation decisions.

The abstract does not clearly name benchmarks or metrics.

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub

Should You Rely On This Paper?

This paper has useful evaluation signal, but protocol completeness is partial; pair it with related papers before deciding implementation strategy.

Best use

Secondary protocol comparison source

Use if you need

A secondary eval reference to pair with stronger protocol papers.

Main weakness

The abstract does not clearly name benchmarks or metrics.

Trust level

Moderate

Usefulness score

55/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Detected

Usefulness for eval research

Moderate-confidence candidate

Extraction confidence 70%

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

What We Could Verify

These are the protocol signals we could actually recover from the available paper metadata. Use them to decide whether this paper is worth deeper reading.