Matched via arXiv identifier search
- Stars
- 3
- Last push
- May 30, 2026 (1d ago)
Risk flags
- No tagged releases
- No Docker setup
- Low confidence match
Ahrii Kim, Seong-heum Kim
Core AI workload signals detected from paper context and implementation/artifact evidence.
Automatic post-editing (APE) aims to refine machine translations by correcting residual errors. Although recent large language models (LLMs) demonstrate strong translation capabilities, their effectiveness for APE--especially under document-level context--remains insufficiently understood. We present a systematic comparison of proprietary and open-weight LLMs under a naive document-level prompting setup, analyzing AP ...
E quality, contextual behavior, robustness, and efficiency. Our results show that proprietary LLMs achieve near human-level APE quality even with simple one-shot prompting, regardless of whether document context is provided. While these models exhibit higher robustness to data poisoning attacks than open-weight counterparts, this robustness also reveals a limitation: they largely fail to exploit document-level context for contextual error correction. Furthermore, standard automatic metrics do not reliably reflect these qualitative improvements, highlighting the continued necessity of human evaluation. Despite their strong performance, the substantial cost and latency overheads of proprietary LLMs render them impractical for real-world APE deployment. Overall, our findings elucidate both the promise and current limitations of LLM-based document-aware APE, and point toward the need for more efficient long-context modeling approaches for translation refinement.
No concrete benchmark grounding is available yet. Treat the page as context or an implementation starting point only.
Automatic post-editing (APE) aims to refine machine translations by correcting residual errors.
Spacial/csstuff is the closest maintained adjacent implementation (Matches contextual method/domain keyword: computer science). It is not paper-verified; validate algorithm and evaluation setup against the paper before trusting reported metrics. Community adoption signal: 85 GitHub stars.
Hardware Notes
Expect multi-day setup/compute for meaningful reproduction based on current guidance.
Evidence graph: 3 refs, 3 links.
Utility signals: depth 65/100, grounding 75/100, status medium.
Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.
Matched via arXiv identifier search
Risk flags
Matched via arXiv identifier search · Strong overlap with paper title keywords
Risk flags
There is no verified maintained implementation yet. Use this baseline plan to decide whether to prototype now or defer.
Hardware requirements
No verified implementation available
No benchmark numbers could be verified. You will not be able to validate reproduction correctness against published numbers.
Framework baselines
Modern transformer training baseline.
Reference transformer building block implementation.
These are not paper-verified. Use them as reference points when no direct implementation is available.
Matches contextual method/domain keyword: computer science
No additional verified repositories beyond the primary recommendation.
These repositories had low-confidence matching signals and are hidden by default.
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches derived from the paper title and method context:
Models
Tip: start with models, then check datasets/spaces if you need evaluation data or demos.
Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.
Tasks
Context (archaeology), Computer science, Risk analysis (engineering), Political science, Linguistics, Business, Machine translation, Economics
Methods
Transformer
Domains
Artificial intelligence, Natural language processing, Work (physics), Action (physics)
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.