HFEPX Hub

CS.LG Papers (Last 90 Days)

Updated from current HFEPX corpus (Apr 12, 2026). 1212 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 1212 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: DROP. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 15, 2026.

Papers: 1,212 Last published: Feb 15, 2026 Global RSS

Cs.LGLast 90d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: High .

Analysis blocks below are computed from the currently loaded sample (60 of 1,212 total papers in this hub).

All Sampled Papers (60) Replication-Ready Only (19)

High-Signal Coverage

100.0%

60 / 60 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

19 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
8 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

7.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 21.6% of papers in this hub.
DROP is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is rater calibration (2.1% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Benchmark Interpretation

DROP appears in 0.5% of hub papers (6/1212); use this cohort for benchmark-matched comparisons.
GSM8K appears in 0.5% of hub papers (6/1212); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 10.6% of hub papers (129/1212); compare with a secondary metric before ranking methods.
cost is reported in 5.7% of hub papers (69/1212); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (7.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (3% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (7.2% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (23.8% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (5.3% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (8.9% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (5.3% coverage).
Annotation unit is under-specified (8.9% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (DROP vs GSM8K) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: DROP Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

Personalized RewardBench: Evaluating Reward Models with Human Aligned…

Highest protocol score with explicit human/eval signal plus Rewardbench.

Strongest benchmark reference

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step To…

Tracesafe-Bench with accuracy gives a fast comparison anchor.

Strongest recent paper

Paper Reconstruction Evaluation: Evaluating Presentation and Hallucin…

Useful for current practice scanning; published Apr 1, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Apr 8, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference, Rubric Rating · Eval: Human Eval, Automatic Metrics · Benchmark: Rewardbench · Metric: Accuracy
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Apr 8, 2026 · Citations: 0 · Score: 8.0

HF: Red Team · Eval: Automatic Metrics · Benchmark: Tracesafe Bench · Metric: Accuracy
Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers
Apr 1, 2026 · Citations: 0 · Score: 8.0

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Paperwrite Bench · Metric: Cost
FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data
Mar 16, 2026 · Citations: 0 · Score: 8.0

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: DROP · Metric: Accuracy
Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness
Mar 16, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Esdr Bench · Metric: Accuracy
Do Phone-Use Agents Respect Your Privacy?
Apr 1, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: APPS · Metric: Task success

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization Apr 8, 2026	Yes Pairwise Preference , Rubric Rating	Human Eval , Automatic Metrics	Rewardbench	Accuracy , Helpfulness	Not Reported
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories Apr 8, 2026	Yes Red Team	Automatic Metrics	Tracesafe Bench	Accuracy	Not Reported
Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers Apr 1, 2026	Yes Rubric Rating	Automatic Metrics	Paperwrite Bench	Cost	Not Reported
FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data Mar 16, 2026	Yes Expert Verification	Automatic Metrics	DROP	Accuracy , Auroc	Not Reported
Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness Mar 16, 2026	Yes Pairwise Preference	Automatic Metrics	Esdr Bench	Accuracy	Not Reported
Do Phone-Use Agents Respect Your Privacy? Apr 1, 2026	Yes Pairwise Preference	Automatic Metrics	APPS , Myphonebench	Task success	Not Reported
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment Mar 23, 2026	Yes Pairwise Preference	Automatic Metrics	MT Bench , AlpacaEval	Accuracy	Not Reported
CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks Mar 19, 2026	Yes Pairwise Preference	Automatic Metrics	Harmbench	Cost	Not Reported
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents Feb 15, 2026	Yes Expert Verification	Simulation Env	Ad Bench	Pass@1 , Pass@3	Not Reported
\$OneMillion-Bench: How Far are Language Agents from Human Experts? Mar 9, 2026	Yes Rubric Rating	Automatic Metrics	Onemillion Bench	Accuracy , Coherence	Not Reported
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking Feb 27, 2026	Yes Red Team	Llm As Judge	AdvBench , Jbf Eval	Success rate , Jailbreak success rate	Not Reported
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models Feb 2, 2026	Yes Expert Verification	Automatic Metrics	Vdr Bench	Not Reported	Adjudication

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	Personalized RewardBench: Evaluating Reward Models…	TraceSafe: A Systematic Assessment of LLM Guardrail…	Paper Reconstruction Evaluation: Evaluating Present…
Human Feedback	Pairwise Preference, Rubric Rating	Red Team	Rubric Rating
Evaluation Modes	Human Eval, Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	Rewardbench	Tracesafe Bench	Paperwrite Bench
Metrics	Accuracy, Helpfulness	Accuracy	Cost
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Pairwise	Trajectory	Multi Dim Rubric

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (43)
Expert Verification (16)
Red Team (13)
Demonstrations (11)

Evaluation Modes

Automatic Metrics (262)
Simulation Env (28)
Llm As Judge (15)
Human Eval (5)

Top Benchmarks

DROP (6)
GSM8K (6)
MMLU (6)
AIME (4)

Top Metrics

Accuracy (129)
Cost (69)
Precision (23)
Latency (21)

Rater Population Mix

Domain Experts (62)
Crowd (1)
Mixed (1)

Quality Controls

Calibration (25)
Adjudication (5)
Inter Annotator Agreement Reported (4)
Gold Questions (2)

Coverage diagnostics (sample-based): human-feedback 80.0% · benchmarks 45.0% · metrics 65.0% · quality controls 13.3%.

Top Papers

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu · Feb 15, 2026 · Citations: 0

Expert Verification Simulation Env Long Horizon

While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
\$OneMillion-Bench: How Far are Language Agents from Human Experts?
Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen · Mar 9, 2026 · Citations: 0

Rubric Rating Automatic Metrics Tool Use

To this end, we introduce \OneMillion-Bench \OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios.
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu · Feb 27, 2026 · Citations: 0

Red Team Llm As Judge Multi Agent

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols.
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen · Feb 2, 2026 · Citations: 0

Expert Verification Automatic Metrics Web Browsing

However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations.
From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring
Seunghwan Kim, Tiffany H. Kung, Heena Verma, Dilan Edirisinghe, Kaveh Sedehi · Mar 10, 2026 · Citations: 0

Expert Verification Automatic Metrics Long Horizon

Results: Against a human majority-vote standard (N=467), the agent achieved 95.8% emergency sensitivity and 88.5% sensitivity for all actionable alerts (85.7% specificity).
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0

Red Team Automatic Metrics Long Horizon

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram · Feb 23, 2026 · Citations: 0

Expert Verification Automatic Metrics

Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype…
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen · Mar 19, 2026 · Citations: 0

Demonstrations Simulation Env Multi Agent

To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component.
APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0

Rubric RatingExpert Verification Automatic Metrics Long Horizon

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate…
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026 · Citations: 0

Pairwise Preference Simulation Env Long Horizon

Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright · Mar 3, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon

Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning
Zhaowei Zhang, Xiaohan Liu, Xuekai Zhu, Junchao Huang, Ceyao Zhang · Mar 11, 2026 · Citations: 0

Rubric Rating Llm As Judge

To enable stable RLVR training, we build a rubric-grounded reward pipeline by training a Qwen3-1.7B judge model.
Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
Richard J. Young · Mar 20, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Three classifiers (a regex-only detector, a regex-plus-LLM pipeline, and a Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters.
RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models
Daniel Yang, Samuel Stante, Florian Redhardt, Lena Libon, Parnian Kassraie · Feb 27, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Reward models are central to aligning large language models (LLMs) with human preferences.
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem
Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue · Feb 16, 2026 · Citations: 0

Llm As Judge Web Browsing

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation.
RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari, Michael Gentile · Apr 2, 2026 · Citations: 0

Expert Verification Llm As JudgeAutomatic Metrics

This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic…
Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure
Davide Di Gioia · Mar 23, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

Autonomous agents operating in continuous environments must decide not only what to do, but when to act.
Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers
Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe, Toshihiko Yamasaki · Apr 1, 2026 · Citations: 0

Rubric Rating Automatic Metrics

We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal…
FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data
Mitul Goswami, Romit Chatterjee, Arif Ahmed Sekh · Mar 16, 2026 · Citations: 0

Expert Verification Automatic Metrics

Post-mitigation evaluation on seven clinically distinct cohorts derived from the MIMIC-IV-ED and eICU databases demonstrates substantial bias reduction: Statistical Parity Difference decreases by 40 to 51 percent on MIMIC-IV-ED and 10 to 19…
Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness
Jingyu Lu, Yuhan Wang, Fan Zhuo, Xize Cheng, Changhao Pan · Mar 16, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps.
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought
Xinghao Zhao · Mar 19, 2026 · Citations: 0

Automatic Metrics Long Horizon

Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive.
ReDAct: Uncertainty-Aware Deferral for LLM Agents
Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov · Apr 8, 2026 · Citations: 0

Simulation Env Long Horizon

Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems.
Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng · Mar 4, 2026 · Citations: 0

Demonstrations Simulation Env

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects…
Multi-Objective Alignment of Language Models for Personalized Psychotherapy
Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli · Feb 17, 2026 · Citations: 0

Pairwise PreferenceExpert Verification Automatic Metrics

While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
Do Phone-Use Agents Respect Your Privacy?
Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang · Apr 1, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

We study whether phone-use agents respect privacy while completing benign mobile tasks.
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith · Mar 23, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility.
CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks
Hao Wang, Licheng Pan, Zhichao Chen, Chunyuan Zheng, Zhixuan Chu · Mar 19, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly…
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen · Mar 3, 2026 · Citations: 0

Red Team Automatic Metrics Web Browsing

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs.
TREX: Trajectory Explanations for Multi-Objective Reinforcement Learning
Dilina Rajapakse, Juan C. Rosero, Ivana Dusparic · Mar 23, 2026 · Citations: 0

Pairwise Preference Long Horizon

Multi-Objective Reinforcement Learning (MORL) addresses this limitation by enabling agents to optimize several objectives simultaneously, explicitly reasoning about trade-offs between them.
LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo
Ojas Jain, Dhruv Kumar · Apr 7, 2026 · Citations: 0

Simulation Env Multi Agent

We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning…
Discovering Implicit Large Language Model Alignment Objectives
Edward Chen, Sanmi Koyejo, Carlos Guestrin · Feb 17, 2026 · Citations: 0

Rubric Rating Human Eval

To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai · Mar 30, 2026 · Citations: 0

Critique Edit Long Horizon

We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe.
QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching · Apr 6, 2026 · Citations: 0

Rubric Rating Automatic Metrics

To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.
MemRerank: Preference Memory for Personalized Product Reranking
Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong · Mar 31, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch.
A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic
Peter Brodeur, Jacob M. Koshy, Anil Palepu, Khaled Saab, Ava Homiar · Mar 9, 2026 · Citations: 0

Expert Verification Automatic Metrics

Translating these systems into clinical practice requires assessment in real-world workflows with rigorous safety oversight.
JAWS: Enhancing Long-term Rollout of Neural PDE Solvers via Spatially-Adaptive Jacobian Regularization
Fengxiang Nie, Yasuhiro Suzuki · Mar 4, 2026 · Citations: 0

Automatic MetricsSimulation Env Long Horizon

Experiments demonstrate that JAWS serves as an effective spectral pre-conditioner for trajectory optimization, allowing short-horizon, memory-efficient training to match the accuracy of long-horizon baselines.
Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks
Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng · Feb 26, 2026 · Citations: 0

Simulation Env Long Horizon

Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks.
SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray · Feb 24, 2026 · Citations: 0

Simulation Env Long Horizon

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning.
Who can we trust? LLM-as-a-jury for Comparative Assessment
Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill · Feb 18, 2026 · Citations: 0

Pairwise Preference

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements.
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
Shiwan Zhao, Zhihu Wang, Xuyang Zhao, Jiaming Zhou, Caiyue Xu · Apr 9, 2026 · Citations: 0

Pairwise Preference Long Horizon

Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines.
HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning
Zhicong Lu, Zichuan Lin, Wei Jia, Changyuan Tian, Deheng Ye · Mar 19, 2026 · Citations: 0

Pairwise Preference Long Horizon

While large language models excel in diverse domains, their performance on complex longhorizon agentic decision-making tasks remains limited.
RoboPocket: Improve Robot Policies Instantly with Your Phone
Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le · Mar 5, 2026 · Citations: 0

Demonstrations Long Horizon

To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones.
TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026 · Citations: 0

Demonstrations Web Browsing

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
IROSA: Interactive Robot Skill Adaptation using Natural Language
Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp · Mar 4, 2026 · Citations: 0

Demonstrations Long Horizon

We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and…
Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus
Anna Van Elst, Kerrian Le Caillec, Igor Colin, Stephan Clémençon · Feb 26, 2026 · Citations: 0

Pairwise Preference Multi Agent

The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originating in social choice theory, have been documented in the literature, offering theoretical…
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He · Feb 17, 2026 · Citations: 0

Pairwise Preference Multi Agent

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and…
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Jack Young · Apr 1, 2026 · Citations: 0

Automatic Metrics Long Horizon

Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval.
Text-to-Stage: Spatial Layouts from Long-form Narratives
Jefferson Hernandez, Swarnadeep Saha, Chenxi Whitehouse, Sanjeel Parekh, Calvin Murdock · Mar 18, 2026 · Citations: 0

Pairwise Preference Llm As Judge

In this work, we probe the ability of a language model to demonstrate spatial reasoning from unstructured text, mimicking human capabilities and automating a process that benefits many downstream media applications.
VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization
Weixin Liu, Congning Ni, Qingyuan Song, Susannah L. Rose, Christopher Symons · Mar 11, 2026 · Citations: 0

Pairwise Preference Llm As Judge

We introduce VERI-DPO, which uses claim verification to mine preferences and distill them into the summarizer with Direct Preference Optimization (DPO).
AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications.
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao · Jan 21, 2026 · Citations: 0

Automatic Metrics Long Horizon

We demonstrate that effective reasoning can be better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead.
Training-Free Dynamic Upcycling of Expert Language Models
Eros Fanì, Oğuzhan Ersoy · Mar 31, 2026 · Citations: 0

Expert Verification

To address these issues, we introduce Dynamic Upcycling MoE (DUME), a novel approach that reuses dense experts trained on different domains to construct a unified MoE model.
ActionParty: Multi-Subject Action Binding in Generative Video Games
Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov · Apr 2, 2026 · Citations: 0

Automatic MetricsSimulation Env Multi Agent

However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene.
IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs
Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin · Mar 11, 2026 · Citations: 0

Red Team Automatic Metrics

IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections.
Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models
Punyajoy Saha, Sudipta Halder, Debjyoti Mondal, Subhadarshi Panda · Mar 7, 2026 · Citations: 0

Pairwise PreferenceRed Team Automatic Metrics

Safety alignment is critical for deploying large language models (LLMs) in real-world applications, yet most existing approaches rely on large human-annotated datasets and static red-teaming benchmarks that are costly, difficult to scale,…
MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
Ashutosh Chaubey, Jiacheng Pang, Mohammad Soleymani · Mar 3, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs.
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu · Feb 21, 2026 · Citations: 0

Red Team Automatic Metrics

We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now