Semantic Novelty at Scale: Narrative Shape Taxonomy and Readership Prediction in 28,606 Books

W. Frederick Zimmerman · Feb 24, 2026 · Citations: 0

Abstract

I introduce semantic novelty--cosine distance between each paragraph's sentence embedding and the running centroid of all preceding paragraphs--as an information-theoretic measure of narrative structure at corpus scale. Applying it to 28,606 books in PG19 (pre-1920 English literature), I compute paragraph-level novelty curves using 768-dimensional SBERT embeddings, then reduce each to a 16-segment Piecewise Aggregate Approximation (PAA). Ward-linkage clustering on PAA vectors reveals eight canonical narrative shape archetypes, from Steep Descent (rapid convergence) to Steep Ascent (escalating unpredictability). Volume--variance of the novelty trajectory--is the strongest length-independent predictor of readership (partial rho = 0.32), followed by speed (rho = 0.19) and Terminal/Initial ratio (rho = 0.19). Circuitousness shows strong raw correlation (rho = 0.41) but is 93 percent correlated with length; after control, partial rho drops to 0.11--demonstrating that naive correlations in corpus studies can be dominated by length confounds. Genre strongly constrains narrative shape (chi squared = 2121.6, p < 10 to the power negative 242), with fiction maintaining plateau profiles while nonfiction front-loads information. Historical analysis shows books became progressively more predictable between 1840 and 1910 (T/I ratio trend r = negative 0.74, p = 0.037). SAX analysis reveals 85 percent signature uniqueness, suggesting each book traces a nearly unique path through semantic space. These findings demonstrate that information-density dynamics, distinct from sentiment or topic, constitute a fundamental dimension of narrative structure with measurable consequences for reader engagement. Dataset: https://huggingface.co/datasets/wfzimmerman/pg19-semantic-novelty

Human Data Lens

Uses human feedback: No
Feedback types: None
Rater population: Unknown
Unit of annotation: Trajectory
Expertise required: General

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: Long Horizon
Quality controls: Not reported
Confidence: 0.40
Flags: ambiguous

Research Summary

Contribution Summary

I introduce semantic novelty--cosine distance between each paragraph's sentence embedding and the running centroid of all preceding paragraphs--as an information-theoretic measure of narrative structure at corpus scale.
Applying it to 28,606 books in PG19 (pre-1920 English literature), I compute paragraph-level novelty curves using 768-dimensional SBERT embeddings, then reduce each to a 16-segment Piecewise Aggregate Approximation (PAA).
Ward-linkage clustering on PAA vectors reveals eight canonical narrative shape archetypes, from Steep Descent (rapid convergence) to Steep Ascent (escalating unpredictability).

Semantic Novelty at Scale: Narrative Shape Taxonomy and Readership Prediction in 28,606 Books

Abstract

Human Data Lens

Evaluation Lens

Research Summary

Contribution Summary

Related Papers