Splits! Flexible Sociocultural Linguistic Investigation at Scale
Eylon Caplan, Tania Chakraborty, Dan Goldwasser · Apr 6, 2025 · Citations: 0
How to use this paper page
Coverage: RecentUse this page to decide whether the paper is strong enough to influence an eval design. It summarizes the abstract plus available structured metadata. If the signal is thin, use it as background context and compare it against stronger hub pages before making protocol choices.
Best use
Background context only
Metadata: RecentTrust level
Low
Signals: RecentWhat still needs checking
Extraction flags indicate low-signal or possible false-positive protocol mapping.
Signal confidence: 0.15
Abstract
Variation in language use, shaped by speakers' sociocultural background and specific context of use, offers a rich lens into cultural perspectives, values, and opinions. For example, Chinese students discuss "healthy eating" with words like "timing," "regularity," and "digestion," whereas Americans use vocabulary like "balancing food groups" and "avoiding fat and sugar," reflecting distinct cultural models of nutrition. The computational study of these Sociocultural Linguistic Phenomena (SLP) has traditionally been done in NLP via tailored analyses of specific groups or topics, requiring specialized data collection and experimental operationalization--a process not well-suited to quick hypothesis exploration and prototyping. To address this, we propose constructing a "sandbox" designed for systematic and flexible sociolinguistic research. Using our method, we construct a demographically/topically split Reddit dataset, Splits!, validated by self-identification and by replicating several known SLPs from existing literature. We showcase the sandbox's utility with a scalable, two-stage process that filters large collections of "potential" SLPs (PSLPs) to surface the most promising candidates for deeper, qualitative investigation.