- AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu · Feb 15, 2026 · Citations: 0
Expert Verification Simulation Env Long Horizon
While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
- PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology
Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva · Mar 2, 2026 · Citations: 0
Rubric RatingExpert Verification Llm As JudgeAutomatic Metrics
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.
- HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026 · Citations: 0
Expert VerificationCritique Edit Automatic Metrics
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
- CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0
Expert Verification Automatic Metrics
To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
- PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Yiqing Zhang, Xiaozhong Liu, Fabricio Murai · Mar 28, 2026 · Citations: 0
Expert Verification Llm As JudgeAutomatic Metrics
In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
- SODIUM: From Open Web Data to Queryable Databases
Chuxuan Hu, Philip Li, Maxwell Yang, Daniel Kang · Mar 19, 2026 · Citations: 0
Expert Verification Automatic Metrics Multi Agent
Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM-Bench, with the strongest baseline achieving only 46.5% accuracy.
- Measuring AI Ability to Complete Long Software Tasks
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia · Mar 18, 2025 · Citations: 0
Expert Verification Automatic Metrics Tool Use
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
- Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang · Mar 27, 2026 · Citations: 0
Rubric RatingExpert Verification Automatic Metrics
To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains.
- FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data
Mitul Goswami, Romit Chatterjee, Arif Ahmed Sekh · Mar 16, 2026 · Citations: 0
Expert Verification Automatic Metrics
Post-mitigation evaluation on seven clinically distinct cohorts derived from the MIMIC-IV-ED and eICU databases demonstrates substantial bias reduction: Statistical Parity Difference decreases by 40 to 51 percent on MIMIC-IV-ED and 10 to 19…
- Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching
Stephen Gadd · Jan 11, 2026 · Citations: 0
Expert Verification Automatic Metrics
Trained on 32.7 million triplet samples drawn from 67 million toponyms spanning GeoNames, Wikidata, and the Getty Thesaurus of Geographic Names, the Student achieves the highest Recall@1 (85.2%) and Mean Reciprocal Rank (90.8%) on the…