- MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026
Automatic Metrics MedicineCoding
Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
- Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text
Bitan Majumder, Anirban Sen · Feb 25, 2026
Automatic MetricsSimulation Env CodingMultilingual
Sarcasm detection in multilingual and code-mixed environments remains a challenging task for natural language processing models due to structural variations, informal expressions, and low-resource linguistic availability.
- Multilingual Large Language Models do not comprehend all natural languages to equal degrees
Natalia Moskvina, Raquel Montero, Masaya Yoshida, Ferdy Hubers, Paolo Morosi · Feb 23, 2026
Automatic Metrics Multilingual
Large Language Models (LLMs) play a critical role in how humans access information.
- Cross-lingual Matryoshka Representation Learning across Speech and Text
Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina · Feb 23, 2026
Automatic Metrics Multilingual
We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best.
- SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation
Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026
Automatic Metrics Multilingual
This limitation stems from the inability of current single-model and static multi-agent systems to perceive and adapt to stylistic variations.
- Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Abhinaba Basu · Feb 21, 2026
Automatic Metrics Multilingual
Personal AI agents incur substantial cost via repeated LLM calls.
- CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello · Feb 19, 2026
Automatic Metrics Multilingual
HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts.
- What Language is This? Ask Your Tokenizer
Clara Meister, Ahmetcan Yavuz, Pietro Lesci, Tiago Pimentel · Feb 19, 2026
Automatic Metrics Multilingual
Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models.
- ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning
Hussein S. Al-Olimat, Ahmad Alshareef · Feb 19, 2026
Automatic Metrics Multilingual
While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification.
- BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization
Ahmed Rafid, Rumman Adib, Fariya Ahmed, Ajwad Abrar, Mohammed Saidul Islam · Feb 18, 2026
Automatic Metrics MedicineMultilingual
However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries.
- IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models
Saurabh Bharti, Gaurav Azad, Abhinaw Jagtap, Nachiket Tapas · Feb 18, 2026
Automatic Metrics Multilingual
The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual complexity.
- Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel · Feb 16, 2026
Automatic Metrics CodingMultilingual
Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particular
- Text Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation
Ruoxi Liu, Philipp Koehn · Feb 16, 2026
Automatic Metrics Multilingual
This paper proposes a novel method for Text Style Transfer (TST) based on parameter-efficient fine-tuning of Large Language Models (LLMs).
- Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque
Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri · Feb 16, 2026
Automatic MetricsSimulation Env Multilingual
Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces.
- CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025
Automatic Metrics CodingMultilingual
To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
- Cross-Lingual Interleaving for Speech Language Models
Adel Moumen, Guangzhi Sun, Philip C. Woodland · Dec 1, 2025
Automatic Metrics Multilingual
However, progress has been largely English-centric due to scarce spoken evaluation benchmarks and training data, making cross-lingual learning difficult.
- Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models
Cheonbok Park, Jeonghoon Kim, Joosung Lee, Sanghwan Bae, Jaegul Choo · Jun 6, 2025
Automatic Metrics MathMultilingual
Reinforcement learning with verifiable reward (RLVR) has been instrumental in eliciting strong reasoning capabilities from large language models (LLMs) via long chains of thought (CoT).
- EuroGEST: Investigating gender stereotypes in multilingual language models
Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch · Jun 4, 2025
Human EvalAutomatic Metrics CodingMultilingual
Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric.
- What if Deception Cannot be Detected? A Cross-Linguistic Study on the Limits of Deception Detection from Text
Aswathy Velutharambath, Kai Sassenberg, Roman Klinger · May 19, 2025
Automatic Metrics Multilingual
We further benchmark against other English deception datasets following similar data collection protocols.
- How much does context affect the accuracy of AI health advice?
Prashant Garg, Thiemo Fetzer · Apr 25, 2025
Automatic Metrics LawMultilingual
English-language performance does not reliably generalise across contexts, underscoring the need for multilingual, domain-specific evaluation before deployment in public-health communication.
- ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees
David Smith Sundarsingh, Jun Wang, Jyotirmoy V. Deshmukh, Yiannis Kantaros · Apr 22, 2025
Automatic Metrics CodingMultilingual
Linear Temporal Logic (LTL) is a widely used task specification language for autonomous systems.