Skip to content
← Back to explorer

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

Gianluca Vico, Jindřich Libovický · Feb 16, 2026 · Citations: 0

Abstract

We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Unknown
  • Unit of annotation: Unknown
  • Expertise required: Coding, Multilingual

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: None
  • Quality controls: Not reported
  • Confidence: 0.30
  • Flags: low_signal, possible_false_positive

Research Summary

Contribution Summary

  • We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy.
  • The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual wo
  • We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation.

Why It Matters For Eval

  • We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation.

Related Papers