HFEPX Hub

Automatic Metrics + General (Last 30 Days)

Updated from current HFEPX corpus (Apr 19, 2026). 64 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 19, 2026). 64 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: HotpotQA. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 8, 2026.

Papers: 64 Last published: Apr 8, 2026 Global RSS Tag RSS

Automatic MetricsGeneralLast 30d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: High .

Analysis blocks below are computed from the currently loaded sample (60 of 64 total papers in this hub).

All Sampled Papers (60) Replication-Ready Only (20)

High-Signal Coverage

100.0%

60 / 60 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

20 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
7 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

43.8% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 100% of papers in this hub.
HotpotQA is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is inter-annotator agreement reporting (7.8% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Benchmark Interpretation

HotpotQA appears in 3.1% of hub papers (2/64); use this cohort for benchmark-matched comparisons.
AlpacaEval appears in 1.6% of hub papers (1/64); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 54.7% of hub papers (35/64); compare with a secondary metric before ranking methods.
cost is reported in 20.3% of hub papers (13/64); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (43.8% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (10.9% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (31.3% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (10.9% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (46.9% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
Agentic evaluation appears in 48.4% of papers.

Known Gaps

Only 10.9% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.9% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (HotpotQA vs AlpacaEval) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: HotpotQA Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

Personalized RewardBench: Evaluating Reward Models with Human Aligned…

Highest protocol score with explicit human/eval signal plus Rewardbench.

Strongest benchmark reference

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step To…

Tracesafe-Bench with accuracy gives a fast comparison anchor.

Strongest recent paper

Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Pa…

Useful for current practice scanning; published Apr 7, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Apr 8, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference, Rubric Rating · Eval: Human Eval, Automatic Metrics · Benchmark: Rewardbench · Metric: Accuracy
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Apr 8, 2026 · Citations: 0 · Score: 8.0

HF: Red Team · Eval: Automatic Metrics · Benchmark: Tracesafe Bench · Metric: Accuracy
Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching
Apr 7, 2026 · Citations: 0 · Score: 8.0

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Scirepeval · Metric: Recall
When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools
Mar 25, 2026 · Citations: 0 · Score: 8.0

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Interaction2eval · Metric: Agreement
Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
Mar 30, 2026 · Citations: 0 · Score: 8.0

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: TruthfulQA · Metric: Accuracy
Stabilizing Rubric Integration Training via Decoupled Advantage Normalization
Mar 27, 2026 · Citations: 0 · Score: 8.0

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Olympiadbench · Metric: Accuracy

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization Apr 8, 2026	Yes Pairwise Preference , Rubric Rating	Human Eval , Automatic Metrics	Rewardbench	Accuracy , Helpfulness	Not Reported
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories Apr 8, 2026	Yes Red Team	Automatic Metrics	Tracesafe Bench	Accuracy	Not Reported
Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching Apr 7, 2026	Yes Rubric Rating	Automatic Metrics	Scirepeval	Recall	Not Reported
When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools Mar 25, 2026	Yes Rubric Rating	Automatic Metrics	Interaction2eval	Agreement	Not Reported
Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation Mar 30, 2026	Yes Rubric Rating	Automatic Metrics	TruthfulQA	Accuracy	Not Reported
Stabilizing Rubric Integration Training via Decoupled Advantage Normalization Mar 27, 2026	Yes Rubric Rating	Automatic Metrics	Olympiadbench	Accuracy	Not Reported
ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims Mar 27, 2026	Yes Pairwise Preference	Automatic Metrics	Codabench	Recall , Recall@k	Not Reported
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment Mar 23, 2026	Yes Pairwise Preference	Automatic Metrics	MT Bench , AlpacaEval	Accuracy	Not Reported
More Human, More Efficient: Aligning Annotations with Quantized SLMs Apr 1, 2026	Yes Rubric Rating	Automatic Metrics	Not Reported	Agreement	Inter Annotator Agreement Reported , Adjudication
From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories Mar 30, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Kappa , Agreement	Inter Annotator Agreement Reported
Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation Mar 20, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Kappa , Faithfulness	Inter Annotator Agreement Reported
ReasonScaffold: A Scaffolded Reasoning-based Annotation Protocol for Human-AI Co-Annotation Mar 22, 2026	Yes Critique Edit	Automatic Metrics	Not Reported	Accuracy , Agreement	Inter Annotator Agreement Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	Personalized RewardBench: Evaluating Reward Models…	TraceSafe: A Systematic Assessment of LLM Guardrail…	Beyond Paper-to-Paper: Structured Profiling and Rub…
Human Feedback	Pairwise Preference, Rubric Rating	Red Team	Rubric Rating
Evaluation Modes	Human Eval, Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	Rewardbench	Tracesafe Bench	Scirepeval
Metrics	Accuracy, Helpfulness	Accuracy	Recall
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Domain Experts
Annotation Unit	Pairwise	Trajectory	Multi Dim Rubric

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (18)
Rubric Rating (6)
Critique Edit (3)
Demonstrations (1)

Evaluation Modes

Automatic Metrics (64)
Simulation Env (7)
Llm As Judge (5)
Human Eval (3)

Top Benchmarks

HotpotQA (2)
AlpacaEval (1)
BFCL (1)
Codabench (1)

Top Metrics

Accuracy (35)
Cost (13)
Latency (8)
Agreement (6)

Rater Population Mix

Domain Experts (7)

Quality Controls

Inter Annotator Agreement Reported (5)
Calibration (2)
Adjudication (1)

Coverage diagnostics (sample-based): human-feedback 46.7% · benchmarks 33.3% · metrics 100.0% · quality controls 11.7%.

Top Papers

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0

Red Team Automatic Metrics Long Horizon

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
More Human, More Efficient: Aligning Annotations with Quantized SLMs
Jiayu Wang, Junyoung Lee · Apr 1, 2026 · Citations: 0

Rubric Rating Automatic Metrics

As Large Language Model (LLM) capabilities advance, the demand for high-quality annotation of exponentially increasing text corpora has outpaced human capacity, leading to the widespread adoption of LLMs in automatic evaluation and…
Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching
Yicheng Pan, Zhiyuan Ning, Ludi Wang, Yi Du · Apr 7, 2026 · Citations: 0

Rubric Rating Automatic Metrics

To address this gap, we propose P2R, a training-free framework that shifts from implicit paper-to-paper matching to explicit profile-based matching.
When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools
Xingming Li, Runke Huang, Yanan Bao, Yuye Jin, Yuru Jiao · Mar 25, 2026 · Citations: 0

Rubric Rating Automatic Metrics

In this paper, we investigate whether AI can serve as a scalable assessment teammate by extracting structured quality indicators and validating their alignment with human expert judgments.
From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories
Daban Q. Jaff · Mar 30, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

After assembling model outputs, we introduce an agreement-based stability taxonomy (ABC) to stratify inter-model output stability.
Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
Richard J. Young · Mar 20, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Three classifiers (a regex-only detector, a regex-plus-LLM pipeline, and a Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters.
HyperMem: Hypergraph Memory for Long-Term Conversations
Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang · Apr 9, 2026 · Citations: 0

Pairwise Preference Llm As JudgeAutomatic Metrics

Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues.
Signals: Trajectory Sampling and Triage for Agentic Interactions
Shuguang Chen, Adil Hafeez, Salman Paracha · Apr 1, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

We propose a lightweight, signal-based framework for triaging agentic interaction trajectories.
Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure
Davide Di Gioia · Mar 23, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

Autonomous agents operating in continuous environments must decide not only what to do, but when to act.
Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
Xinran Zhang · Mar 30, 2026 · Citations: 0

Rubric Rating Automatic Metrics

Atomic decomposition -- breaking a candidate answer into claims before verifying each against a reference -- is a widely adopted design for LLM-based reference-grounded judges.
Stabilizing Rubric Integration Training via Decoupled Advantage Normalization
Zelin Tan, Zhouliang Yu, Bohan Lin, Zijie Geng, Hejia Geng · Mar 27, 2026 · Citations: 0

Rubric Rating Automatic Metrics

We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward…
ReasonScaffold: A Scaffolded Reasoning-based Annotation Protocol for Human-AI Co-Annotation
Smitha Muthya Sudheendra, Jaideep Srivastava · Mar 22, 2026 · Citations: 0

Critique Edit Automatic Metrics

We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior.
Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning
Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao · Apr 7, 2026 · Citations: 0

Expert Verification Automatic Metrics

These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly…
Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE
Hejin Huang, Jusheng Zhang, Kaitong Cai, Jian Wang, Rong Pan · Mar 31, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Preference-based alignment objectives have been widely adopted, from RLHF-style pairwise learning in large language models to emerging applications in recommender systems.
ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims
Raia Abu Ahmad, Max Upravitelev, Aida Usmanova, Veronika Solopova, Georg Rehm · Mar 27, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

In addition to standard evaluation metrics (Recall@K and Binary Preference), we adapt an automated framework to assess retrieval quality under incomplete annotations, exposing systematic biases in how conventional metrics rank systems.
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith · Mar 23, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility.
How Much LLM Does a Self-Revising Agent Actually Need?
Sungwoo Jung, Seonil Son · Apr 8, 2026 · Citations: 0

Critique Edit Automatic Metrics

Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop.
MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control
Yuchi Wang, Haiyang Yu, Weikang Bian, Jiefeng Long, Xiao Liang · Apr 7, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
Changgeon Ko, Jisu Shin, Hoyun Song, Huije Lee, Eui Jun Hwang · Apr 7, 2026 · Citations: 0

Automatic MetricsSimulation Env Multi Agent

Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision.
Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu · Apr 2, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process.
Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences
Simona-Vasilica Oprea, Adela Bâra · Apr 1, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task.
MemRerank: Preference Memory for Personalized Product Reranking
Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong · Mar 31, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch.
Routing Sensitivity Without Controllability: A Diagnostic Study of Fairness in MoE Language Models
Junhyeok Lee, Kyu Sung Choi · Mar 28, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

FARE reveals that routing-level preference shifts are either unachievable (Mixtral, Qwen1.5, Qwen3), statistically non-robust (DeepSeekMoE), or accompanied by substantial utility cost (OLMoE, -4.4%p CrowS-Pairs at -6.3%p TQA).
LLM-as-a-Judge for Time Series Explanations
Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar · Apr 2, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics

Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional…
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
Shu Wang, Edwin Yu, Oscar Love, Tom Zhang, Tom Wong · Apr 6, 2026 · Citations: 0

Automatic Metrics Long Horizon

Large Language Model (LLM) agents require persistent memory to maintain personalization, factual continuity, and long-horizon reasoning, yet standard context-window and retrieval-augmented generation (RAG) pipelines degrade over…
Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
Xuan Qi · Apr 2, 2026 · Citations: 0

Automatic Metrics Tool Use

Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood.
OSCAR: Orchestrated Self-verification and Cross-path Refinement
Yash Shah, Abhijit Chakraborty, Naresh Kumar Devulapally, Vishnu Lokhande, Vivek Gupta · Apr 2, 2026 · Citations: 0

Automatic Metrics Long Horizon

We introduce a suite of trajectory-level assessments, including a cross-chain divergence-at-hallucination (CDH) metric, for principled comparison of localization methods.
Asymmetric Actor-Critic for Multi-turn LLM Agents
Shuli Jiang, Zhaoyang Zhang, Yi Zhang, Shuo Yang, Wei Xia · Mar 31, 2026 · Citations: 0

Automatic Metrics Long Horizon

In many real-world applications, agents must succeed in one-shot settings where retries are impossible.
EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises
Ankush Agarwal, Harsh Vishwakarma, Suraj Nagaje, Chaitanya Devaguptapu · Mar 23, 2026 · Citations: 0

Automatic Metrics Long Horizon

Deploying AI agents in enterprise environments requires balancing capability with data sovereignty and cost constraints.
Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala, Jouni Isoaho · Apr 6, 2026 · Citations: 0

Human EvalAutomatic Metrics

However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation.
SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks
Sunder Ali Khowaja, Kapal Dev, Engin Zeydan, Madhusanka Liyanage · Apr 2, 2026 · Citations: 0

Automatic MetricsSimulation Env

In this regard, we propose the Synthetic Data Generation with Ethics Audit Loop (SEAL) framework, which extends baseline modular pipelines with an Ethical and Regulatory Compliance by Design (ERCD) module and a Federated Learning (FL)…
State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation
Navan Preet Singh, Anurag Garikipati, Ahmed Abulkhair, Jyani Akshay Jagdishbhai, Atul Yaduvanshi · Apr 7, 2026 · Citations: 0

Demonstrations Automatic Metrics

Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by…
ActionParty: Multi-Subject Action Binding in Generative Video Games
Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov · Apr 2, 2026 · Citations: 0

Automatic MetricsSimulation Env Multi Agent

However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene.
PLOT: Enhancing Preference Learning via Optimal Transport
Liang Zhu, Yuelin Bai, Xiankun Ren, Jiaxi Yang, Lei Zhang · Apr 2, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Preference learning in Large Language Models (LLMs) has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global…
ThinknCheck: Grounded Claim Verification with Compact, Reasoning-Driven, and Interpretable Models
Delip Rao, Feijiang Han, Chris Callison-Burch · Apr 2, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

By contrast, zero-shot chain-of-thought on the base Gemma3-1B harms accuracy relative to direct answers, and preference optimization with a simple format+accuracy reward underperforms supervised reasoning.
Multi-Agent Dialectical Refinement for Enhanced Argument Classification
Jakub Bąba, Jarosław A. Chudziak · Mar 29, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics Multi Agent

We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty.
MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation
Taolin Han, Shuang Wu, Jinghang Wang, Yuhao Zhou, Renquan Lv · Mar 26, 2026 · Citations: 0

Automatic MetricsSimulation Env Long Horizon

Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and…
OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang · Mar 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement.
BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents
Praveen Kumar Myakala, Manan Agrawal, Rahul Manche · Mar 25, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Automatic Metrics

LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved.
Mind over Space: Can Multimodal Large Language Models Mentally Navigate?
Qihui Zhu, Shouwei Ruan, Xiao Yang, Hao Jiang, Yao Huang · Mar 23, 2026 · Citations: 0

Automatic MetricsSimulation Env Web Browsing

Despite the widespread adoption of MLLMs in embodied agents, their capabilities remain largely confined to reactive planning from immediate observations, consistently failing in spatial reasoning across extensive spatiotemporal scales.
Diff-KD: Diffusion-based Knowledge Distillation for Collaborative Perception under Corruptions
Pengcheng Lyu, Chaokun Zhang, Gong Chen, Tao Tang, Zhaoxiang Luo · Apr 2, 2026 · Citations: 0

Automatic Metrics Multi Agent

Multi-agent collaborative perception enables autonomous systems to overcome individual sensing limits through collective intelligence.
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit · Apr 7, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics

We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without…
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
Zhifei Xie, Zongzheng Hu, Fangda Ye, Xin Zhang, Haobo Chai · Apr 9, 2026 · Citations: 0

Automatic Metrics Long Horizon

Prior work remains largely confined to laboratory settings, leaving a clear gap in real-world proactive agent: depth, complexity, ambiguity, precision and real-time constraints.
Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency
Guan-Ting Lin, Chen Chen, Zhehuai Chen, Hung-yi Lee · Apr 6, 2026 · Citations: 0

Automatic Metrics Tool Use

We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use.
$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi · Apr 1, 2026 · Citations: 0

Automatic Metrics Long Horizon

As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound.
FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol
Jie Zhu, Yimin Tian, Boyang Li, Kehao Wu, Zhongzhi Liang · Mar 26, 2026 · Citations: 0

Automatic Metrics Tool Use

This paper introduces FinMCP-Bench, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols.
Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards
Cheng Jiayang, Xin Liu, Zhihan Zhang, Haoyang Wen, Zixuan Zhang · Mar 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present a framework addressing both challenges.
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang · Apr 9, 2026 · Citations: 0

Automatic Metrics Tool Use

The advent of agentic multimodal models has empowered systems to actively interact with external environments.
LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications
Mayank Mayank, Bharanidhar Duraisamy, Florian Geiss · Apr 2, 2026 · Citations: 0

Automatic Metrics Long Horizon

Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset…
Cognitive Friction: A Decision-Theoretic Framework for Bounded Deliberation in Tool-Using Agents
Davide Di Gioia · Mar 31, 2026 · Citations: 0

Automatic Metrics Tool Use

Autonomous tool-using agents in networked environments must decide which information source to query and when to stop querying and act.
AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents
Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang · Mar 29, 2026 · Citations: 0

Automatic Metrics Long Horizon

As large language models (LLMs) evolve into autonomous agents for long-horizon information-seeking, managing finite context capacity has become a critical bottleneck.
S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation
Ligong Han, Hao Wang, Han Gao, Kai Xu, Akash Srivastava · Mar 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models.
Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou · Apr 8, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics

We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations.
Learning to Play Blackjack: A Curriculum Learning Perspective
Amirreza Alasti, Efe Erdal, Yücel Celik, Theresa Eimer · Mar 31, 2026 · Citations: 0

Automatic MetricsSimulation Env

We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually.
When Hate Meets Facts: LLMs-in-the-Loop for Check-worthiness Detection in Hate Speech
Nicolás Benjamín Ocampo, Tommaso Caselli, Davide Ceolin · Mar 26, 2026 · Citations: 0

Human EvalAutomatic Metrics

We validate it through extensive human evaluation, and show that our LLM-in-the-loop framework reduces human effort without compromising the annotation quality of the data.
Does Explanation Correctness Matter? Linking Computational XAI Evaluation to Human Understanding
Gregor Baer, Chao Zhang, Isel Grau, Pieter Van Gorp · Mar 26, 2026 · Citations: 0

Automatic MetricsSimulation Env

Higher correctness is assumed to produce better human understanding, but this link has not been tested experimentally with controlled levels.
Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing
Wenhao Yuan, Chenchen Lin, Jian Chen, Jinfeng Xu, Xuehe Wang · Apr 9, 2026 · Citations: 0

Automatic Metrics Long Horizon

In large language model (LLM) agents, reasoning trajectories are treated as reliable internal beliefs for guiding actions and updating memory.
Learning to Interrupt in Language-based Multi-agent Communication
Danqing Wang, Da Yin, Ruta Desai, Lei Li, Asli Celikyilmaz · Apr 7, 2026 · Citations: 0

Automatic Metrics Multi Agent

Motivated by this, we propose an interruptible communication framework that allows the agent who is listening to interrupt the current speaker.
Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency
Payal Fofadiya, Sunil Tiwari · Apr 2, 2026 · Citations: 0

Automatic Metrics Long Horizon

Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now