OpenTrain AI
No verified implementation yet

ENEIDE: A High Quality Silver Standard Dataset for Named Entity Recognition and Linking in Historical Italian

Cristian Santini, Sebastian Barzaghi, Paolo Sernani, Emanuele Frontoni, Laura Melosi +1 more

March 31, 2026arXiv: 2603.29801
0 repos~a few days to reproduce
arXiv PDF

Abstract

This paper introduces ENEIDE (Extracting Named Entities from Italian Digital Editions), a silver standard dataset for Named Entity Recognition and Linking (NERL) in historical Italian texts. The corpus comprises 2,111 documents with over 8,000 entity annotations semi-automatically extracted from two scholarly digital editions: Digital Zibaldone, the philosophical diary of the Italian poet Giacomo Leopardi (1798--1837...

Summary

ENEIDE (Extracting Named Entities from Italian Digital Editions) is introduced as a silver-standard dataset for Named Entity Recognition and Linking (NERL) in historical Italian. It contains 2,111 documents and over 8,000 entity annotations across multiple types, linked to Wikidata (including NIL entities) and organized in standard train/dev/test splits. The dataset is built via a semi-automatic extraction pipeline over curated scholarly digital editions, targeting historically oriented NERL research.

Key Contributions

  • Introduces ENEIDE, a silver-standard NERL dataset for historical Italian, with 2,111 documents and over 8,000 entity annotations linked to Wikidata, including NIL entities and standard train/dev/test splits.
  • Formalizes a semi-automatic annotation extraction pipeline from manually curated scholarly digital editions to build the NERL dataset.
  • Provides benchmark recall results for models evaluated on ENEIDE, including Minerva-7B and Ministral-8B.

Reproducibility Notes

  • Carefully reconstruct the annotation extraction pipeline and NERL evaluation from the text.
  • Track all assumptions and pin software versions before running comparisons.
  • Expect multi-day effort for setup and meaningful reproduction under current guidance.

Results & Benchmarks

TaskDatasetMetricValue
High Quality Silver Standard Dataset NamedMinerva-7BRecall0.052
High Quality Silver Standard Dataset NamedMinistral-8BRecall0.340

Hardware Requirements

  • Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Best Implementation

Maintained implementation evidence is not confirmed for this paper yet.

Use the Implementation Status and Reproduction Path sections below for the current action plan.

Reproduction Path

Follow this baseline workflow to decide if this paper is worth immediate prototyping.

  1. 1

    Use the paper and benchmark evidence to scope a baseline reproduction plan.

  2. 2

    Track assumptions and missing details in an experiment log before coding.

Time to first repro: a few daysEstimate is based on paper-only reproduction flow

Additional Implementations

No additional verified repositories beyond the primary recommendation.

Hugging Face Artifacts

No trustworthy direct or curated related Hugging Face artifacts were found yet.

Continue with targeted Hugging Face searches: