- AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu · Feb 15, 2026 · Citations: 0
Expert Verification Simulation Env Long Horizon
While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
- HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026 · Citations: 0
Expert VerificationCritique Edit Automatic Metrics
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
- SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Sher Badshah, Ali Emami, Hassan Sajjad · Feb 13, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
- Document Reconstruction Unlocks Scalable Long-Context RLVR
Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin · Feb 9, 2026 · Citations: 0
Rubric Rating Automatic Metrics
However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.
- MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen · Feb 18, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Web Browsing
Existing evaluations of agents with memory typically assess memorization and action in isolation.
- Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
- Can Large Language Models Replace Human Coders? Introducing ContentBench
Michael Haman · Feb 23, 2026 · Citations: 0
Critique Edit Automatic Metrics
This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
- SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
- D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration --…
- Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.
- Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026 · Citations: 0
Automatic Metrics Tool Use
Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
- BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
Yunseung Lee, Subin Kim, Youngjun Kwak, Jaegul Choo · Feb 19, 2026 · Citations: 0
Automatic Metrics Long Horizon
However, such errors have rarely been captured by existing benchmarks.