Research Utility Snapshot
Evaluation Modes
- Automatic Metrics (15)
- Simulation Env (3)
- Human Eval (1)
Human Feedback Types
- Pairwise Preference (4)
- Expert Verification (3)
- Red Team (3)
Required Expertise
- Multilingual (16)
- Coding (7)
- Law (1)
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng, Kyle Lam · Feb 25, 2026 · Citations: 0
Expert Verification Automatic Metrics MedicineCoding
- Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
- We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case.
Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text Bitan Majumder, Anirban Sen · Feb 25, 2026 · Citations: 0
Automatic MetricsSimulation Env CodingMultilingual
ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li, Shujian Huang · Feb 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Multilingual
- Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged
SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026 · Citations: 0
Automatic Metrics Multilingual
- This limitation stems from the inability of current single-model and static multi-agent systems to perceive and adapt to stylistic variations.
- To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task.
IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages Priyaranjan Pattnayak, Sanchari Chowdhuri · Feb 18, 2026 · Citations: 0
Red Team Automatic Metrics CodingMultilingual
- Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied.
- We introduce \textbf{Indic Jailbreak Robustness (IJR)}, a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (2.1 Billion speakers), covering 45216 prompts in JSON (contract-bound) and Free (naturalistic)
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai · Feb 18, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Multilingual
- The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment.
- In this work, we propose a resource-efficient method for improving multilingual safety alignment.
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026 · Citations: 0
Red Team Automatic Metrics LawMultilingual
- LLM-based agents execute real-world workflows via tools and memory.
- These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios.
Rethinking Metrics for Lexical Semantic Change Detection Roksana Goworek, Haim Dubossarsky · Feb 17, 2026 · Citations: 0
Pairwise Preference Automatic Metrics CodingMultilingual
Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri · Feb 16, 2026 · Citations: 0
Automatic MetricsSimulation Env Multilingual
- Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces.
Unlocking Reasoning Capability on Machine Translation in Large Language Models Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio, Tom Kocmi · Feb 16, 2026 · Citations: 0
Critique Edit Automatic Metrics MathMultilingual
- We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.
A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit, Thomas Pickard · Jan 13, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Multilingual
- The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects.
- The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri, Trizal Garg · Dec 26, 2025 · Citations: 0
Expert Verification Automatic Metrics CodingMultilingual
- To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
- Recognizing linguistic diversity, we construct the benchmark in both English and Hindi, establishing a framework that is open for further extension to other regional languages.
World Simulation with Video Foundation Models for Physical AI NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji · Oct 28, 2025 · Citations: 0
Simulation Env CodingMultilingual
- These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
- To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nv
HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang · Jun 4, 2025 · Citations: 0
Expert Verification Automatic Metrics Multilingual
- However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences
- Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations.
EuroGEST: Investigating gender stereotypes in multilingual language models Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch · Jun 4, 2025 · Citations: 0
Human EvalAutomatic Metrics CodingMultilingual
- Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric.
- EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics.
Refusal Direction is Universal Across Safety-Aligned Languages Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank · May 22, 2025 · Citations: 0
Red Team Automatic Metrics Multilingual
- Refusal mechanisms in large language models (LLMs) are essential for ensuring safety.
- In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages.