- Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka · Apr 2, 2026 · Citations: 0
Pairwise Preference Llm As JudgeAutomatic Metrics
A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
- A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models
Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan · Apr 7, 2026 · Citations: 0
Expert Verification Automatic Metrics
Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
- LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki · Mar 6, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics Long Horizon
To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic,…
- Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning
He Huang · Mar 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
We evaluate alignment quality using pairwise metrics, specifically ROC-AUC and triplet accuracy, on curated Egyptian-English and intra-Egyptian cognate datasets.
- A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic
Peter Brodeur, Jacob M. Koshy, Anil Palepu, Khaled Saab, Ava Homiar · Mar 9, 2026 · Citations: 0
Expert Verification Automatic Metrics
Translating these systems into clinical practice requires assessment in real-world workflows with rigorous safety oversight.
- MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0
Expert Verification Automatic Metrics
Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
- Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan · Feb 26, 2026 · Citations: 0
Red Team Automatic Metrics
Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs.
- Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
- Rethinking Metrics for Lexical Semantic Change Detection
Roksana Goworek, Haim Dubossarsky · Feb 17, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and
- The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective
Ali Zahedzadeh, Behnam Bahrak · Feb 15, 2026 · Citations: 0
Automatic Metrics Long Horizon
Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct answers.To operationalize this view, we introduce an evaluation…
- Evaluating LLM-Based Translation of a Low-Resource Technical Language: The Medical and Philosophical Greek of Galen
James L. Zainaldin, Cameron Pattison, Manuela Marai, Jacob Wu, Mark J. Schiefsky · Feb 27, 2026 · Citations: 0
Human EvalAutomatic Metrics
This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose.
- Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs
Vedant Pandya · Mar 19, 2026 · Citations: 0
Automatic Metrics Long Horizon
We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation…
- Video-Based Reward Modeling for Computer-Use Agents
Linxin Song, Jieyu Zhang, Huanxin Sheng, Taiwei Shi, Gupta Rahul · Mar 10, 2026 · Citations: 0
Automatic Metrics Long Horizon
Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction.
- Voxtral TTS
Mistral-AI, :, Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg · Mar 26, 2026 · Citations: 0
Human EvalAutomatic Metrics
In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5.
- Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties
Jannis Vamvas, Ignacio Pérez Prat, Angela Heldstab, Dominic P. Fischer, Sina Ahmadi · Mar 26, 2026 · Citations: 0
Human EvalAutomatic Metrics
A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.
- BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages
Jason Lucas, Matt Murtagh-White, Adaku Uchendu, Ali Al-Lawati, Michiharu Yamashita · Feb 28, 2026 · Citations: 0
Automatic Metrics Multi Agent
We introduce BLUFF, a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human-written fact-checked content (122K+ samples across 57 languages) and LLM-generated…
- SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation
Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026 · Citations: 0
Automatic Metrics Multi Agent
To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task.
- EnsembleLink: Accurate Record Linkage Without Training Data
Noah Dasanaike · Jan 29, 2026 · Citations: 0
Automatic Metrics Tool Use
On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling.