A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research.
Every paper includes structured metadata for quick triage.
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026
Citations: 0
Automatic MetricsMath
Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026
Citations: 0
Automatic MetricsMulti AgentGeneral
In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
Current CoT evaluation narrowly focuses on target task accuracy.
Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026
Citations: 0
Automatic MetricsTool UseGeneral
Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
Additional evaluation on an earlier exam sample revealed that the writings have become more complex over a 7-10-year period, while accuracy still reached 0.8 with some feature sets.
The results have been implemented in the writing evaluation module of an Estonian open-source language learning environment.
Alex Stein, Furong Huang, Tom Goldstein · Feb 24, 2026
Citations: 0
Automatic MetricsLong HorizonMath
Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
Reward models play a fundamental role in aligning large language models with human preferences.
Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational ove
Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab · Oct 27, 2025
Citations: 0
Human EvalAutomatic MetricsCoding
We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural n
Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English.
Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
We introduce ResGo, a benchmark of laparoscopic frames annotated with Go Zone bounding boxes and clinician-authored rationales covering phase, exposure quality reasoning, next action and risk reminder.
Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K.
Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li, Gang Yang · Feb 23, 2026
Citations: 0
Automatic MetricsLong HorizonMedicineCoding
The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025
Citations: 0
Automatic MetricsLong HorizonGeneral
A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes can
Experiments on HotpotQA (text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1 gains over strong single-pass, multi-hop, and agentic RAG baselines with high efficiency.
Xiang Zheng, Weiqi Zhai, Wei Wang, Boyu Yang, Wenbo Li, Ruixiang Luo · Jan 31, 2026
Citations: 0
Automatic MetricsMulti AgentMath
Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence.
To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning.
Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.
However, existing benchmarks for multimodal browsing remain limited in task complexity, evidence accessibility, and evaluation granularity, hindering comprehensive and reproducible assessments of deep search capabilities.
Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park, Jihee Kim · Oct 8, 2025
Citations: 0
Automatic MetricsSimulation EnvGeneral
To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals.
Our evaluation reveals critical limitations in current LLMs' context-dependent reasoning.