ENEIDE: A High Quality Silver Standard Dataset for Named Entity Recognition and Linking in Historical Italian
Cristian Santini, Sebastian Barzaghi, Paolo Sernani, Emanuele Frontoni, Laura Melosi +1 more
Abstract
This paper introduces ENEIDE (Extracting Named Entities from Italian Digital Editions), a silver standard dataset for Named Entity Recognition and Linking (NERL) in historical Italian texts. The corpus comprises 2,111 documents with over 8,000 entity annotations semi-automatically extracted from two scholarly digital editions: Digital Zibaldone, the philosophical diary of the Italian poet Giacomo Leopardi (1798--1837...
Summary
ENEIDE (Extracting Named Entities from Italian Digital Editions) is introduced as a silver-standard dataset for Named Entity Recognition and Linking (NERL) in historical Italian. It contains 2,111 documents and over 8,000 entity annotations across multiple types, linked to Wikidata (including NIL entities) and organized in standard train/dev/test splits. The dataset is built via a semi-automatic extraction pipeline over curated scholarly digital editions, targeting historically oriented NERL research.
Key Contributions
- Introduces ENEIDE, a silver-standard NERL dataset for historical Italian, with 2,111 documents and over 8,000 entity annotations linked to Wikidata, including NIL entities and standard train/dev/test splits.
- Formalizes a semi-automatic annotation extraction pipeline from manually curated scholarly digital editions to build the NERL dataset.
- Provides benchmark recall results for models evaluated on ENEIDE, including Minerva-7B and Ministral-8B.
Reproducibility Notes
- Carefully reconstruct the annotation extraction pipeline and NERL evaluation from the text.
- Track all assumptions and pin software versions before running comparisons.
- Expect multi-day effort for setup and meaningful reproduction under current guidance.
Results & Benchmarks
| Task | Dataset | Metric | Value |
|---|---|---|---|
| High Quality Silver Standard Dataset Named | Minerva-7B | Recall | 0.052 |
| High Quality Silver Standard Dataset Named | Ministral-8B | Recall | 0.340 |
Hardware Requirements
- Expect multi-day setup/compute for meaningful reproduction based on current guidance.
Best Implementation
Maintained implementation evidence is not confirmed for this paper yet.
Use the Implementation Status and Reproduction Path sections below for the current action plan.
Reproduction Path
Follow this baseline workflow to decide if this paper is worth immediate prototyping.
- 1
Use the paper and benchmark evidence to scope a baseline reproduction plan.
- 2
Track assumptions and missing details in an experiment log before coding.
Additional Implementations
No additional verified repositories beyond the primary recommendation.
Hugging Face Artifacts
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches: