- Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka · Apr 2, 2026 · Citations: 0
Pairwise Preference Llm As JudgeAutomatic Metrics
A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
- A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models
Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan · Apr 7, 2026 · Citations: 0
Expert Verification Automatic Metrics
Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
- Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning
He Huang · Mar 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
We evaluate alignment quality using pairwise metrics, specifically ROC-AUC and triplet accuracy, on curated Egyptian-English and intra-Egyptian cognate datasets.
- Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs
Vedant Pandya · Mar 19, 2026 · Citations: 0
Automatic Metrics Long Horizon
We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation…
- To Write or to Automate Linguistic Prompts, That Is the Question
Marina Sánchez-Torrón, Daria Akselrod, Jason Rauchwerk · Mar 26, 2026 · Citations: 0
Expert Verification
We present the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment, evaluating five model…
- Voxtral TTS
Mistral-AI, :, Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg · Mar 26, 2026 · Citations: 0
Human EvalAutomatic Metrics
In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5.
- Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties
Jannis Vamvas, Ignacio Pérez Prat, Angela Heldstab, Dominic P. Fischer, Sina Ahmadi · Mar 26, 2026 · Citations: 0
Human EvalAutomatic Metrics
A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.
- Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not
Sercan Karakaş · Apr 6, 2026 · Citations: 0
Pairwise Preference
Large language models achieve strong performance on many language tasks, yet it remains unclear whether they integrate world knowledge with syntactic structure in a human-like, structure-sensitive way during ambiguity resolution.
- Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation
Ying Li, Xinglin Lyu, Junhui Li, Jinlong Yang, Hengchao Shang · Mar 26, 2026 · Citations: 0
Pairwise Preference
In this paper, we propose Cross-Preference Learning (CPL), a preference-based training framework that explicitly captures the complementary benefits of sentence-level and context-aware MT.
- Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset
Ryoma Suzuki, Zhiyang Qi, Michimasa Inaba · Mar 24, 2026 · Citations: 0
Pairwise Preference
The quality of ``Multilingual KokoroChat'' was rigorously validated through human preference studies.