SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

Q: How reproducible is "SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction"?

Estimated time to first reproduction: a few days. Risk flags: No repository-level reproducibility signals are currently available, Estimate assumes artifact-level reproduction; full training reproduction may require additional paper details.. Use the paper-linked Hugging Face release as the starting artifact, then reconstruct training and evaluation settings from the paper.

David Števaňák, Marek Šuppa

Published: Mar 16, 2026

No direct implementation yet

Evidence: Inferred

Domain fit: AI-core

Verified repos: 0

Core AI workload signals detected from paper context and implementation/artifact evidence.

Time to first repro: a few days

2 risk flags

arXiv PDF

Keyphrase extraction for morphologically rich, low-resource languages remains understudied, largely due to the scarcity of suitable evaluation datasets. We address this gap for Slovak by constructing a dataset of 227,432 scientific abstracts with author-assigned keyphrases -- scraped and systematically cleaned from the Slovak Central Register of Theses -- representing a 25-fold increase over the largest prior Slovak ...

Read full abstract

resource and approaching the scale of established English benchmarks such as KP20K. Using this dataset, we benchmark three unsupervised baselines (YAKE, TextRank, KeyBERT with SlovakBERT embeddings) and evaluate KeyLLM, an LLM-based extraction method using GPT-3.5-turbo. Unsupervised baselines achieve at most 11.6\% exact-match $F1@6$, with a large gap to partial matching (up to 51.5\%), reflecting the difficulty of matching inflected surface forms to author-assigned keyphrases. KeyLLM narrows this exact--partial gap, producing keyphrases closer to the canonical forms assigned by authors, while manual evaluation on 100 documents ($κ= 0.61$) confirms that KeyLLM captures relevant concepts that automated exact matching underestimates. Our analysis identifies morphological mismatch as the dominant failure mode for statistical methods -- a finding relevant to other inflected languages. The dataset (https://huggingface.co/datasets/NaiveNeuron/SlovKE) and evaluation code (https://github.com/NaiveNeuron/SlovKE) are publicly available.

Technical details

Canonical key: arxiv-2603.15523

Cache status: Fresh

Generated at: May 29, 2026, 3:15 AM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: No

LLM status: ready

LLM model: openai/gpt-5.1-20251113

LLM generated: May 24, 2026, 5:41 AM

LLM content type: researcher_benchmark_brief

HF policy: hf-relevance-v27

LLM evidence refs: paper.abstract, evidencePack.paperSections[id=paper_caption_3], evidencePack.paperSections[id=paper_table_1], evidencePack.paperSections[id=paper_caption_5], evidencePack.paperSections[id=paper_table_2], evidencePack.paperSections[id=paper_19], evidencePack.paperSections[id=paper_16], evidencePack.paperSections[id=paper_table_3], researcherSummary.benchmarkSnapshot[0], researcherSummary.benchmarkSnapshot[1], paper.title, summary.hasReliableImplementation

implementation starting point

Benchmarks: thin evidence

Time to repro: a few days

2 risk flags

Results & Benchmarks

Freshness tier: hot

Direct + Inferred Evidence

On SlovKE, unsupervised baselines reach at most 11.6% exact-match F1@6 but up to

Slovak by constructing

11.6

Source: llm grounded

KeyLLM achieves an exact-match F1@6 of approximately 15.2 on SlovKE, substantial

SlovKE

Source: llm grounded

Benchmark evidence drill-down

2 findings

Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.

Task	Dataset	Metric	Value	Source	Evidence refs
On SlovKE, unsupervised baselines reach at most 11.6% exact-match F1@6 but up to	Slovak by constructing	F1	11.6	llm-grounded	No explicit refs
KeyLLM achieves an exact-match F1@6 of approximately 15.2 on SlovKE, substantial	SlovKE	F1	6	llm-grounded	No explicit refs

Keyphrase extraction for morphologically rich, low-resource languages remains understudied, largely due to the scarcity of suitable evaluation datasets.

Implementation Evidence Summary

Confidence: low

No direct maintained repository implementation was found, but paper-linked Hugging Face artifacts are available.

Reproduction Risks

Estimate assumes artifact-level reproduction; full training reproduction may require additional paper details.

Hardware Notes

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Evidence disclosure

Evidence graph: 2 refs, 1 links.

Utility signals: depth 95/100, grounding 68/100, status medium.

Implementation Comparison

Top 1 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

eerstar/LLM-Agent-paper-daily

alternative

Maintenance: Active

Confidence: Low

Reproducibility: Moderate

Matched via arXiv identifier search

Stars: 0
Last push: May 25, 2026 (4d ago)

CIDependencies

Risk flags

No tagged releases
No Docker setup
Low confidence match

Implementation Status

No verified maintained repo

There is no verified maintained implementation yet. Use this baseline plan to decide whether to prototype now or defer.

Use the paper-linked Hugging Face release as the starting artifact, then reconstruct training and evaluation settings from the paper.
No direct maintained implementation was found. Use the paper PDF and citation graph to design a baseline reproduction.
Track assumptions and missing details in an experiment log before coding.

Time to first repro: a few days