HFEPX Hub

CS.LG + General Papers

Updated from current HFEPX corpus (Apr 9, 2026). 123 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 123 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: ALFWorld. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 8, 2026.

Papers: 123 Last published: Apr 8, 2026 Global RSS Tag RSS

Cs.LGGeneral

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: High .

Analysis blocks below are computed from the currently loaded sample (60 of 123 total papers in this hub).

All Sampled Papers (60) Replication-Ready Only (12)

High-Signal Coverage

100.0%

60 / 60 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

12 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
4 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

72.4% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 41.5% of papers in this hub.
ALFWorld is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is rater calibration (1.6% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Benchmark Interpretation

ALFWorld appears in 2.4% of hub papers (3/123); use this cohort for benchmark-matched comparisons.
DROP appears in 1.6% of hub papers (2/123); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 17.9% of hub papers (22/123); compare with a secondary metric before ranking methods.
cost is reported in 8.1% of hub papers (10/123); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (72.4% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (3.3% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (17.9% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (38.2% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8.1% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (29.3% vs 35% target).

Strengths

Strong human-feedback signal (72.4% of papers).
Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
Agentic evaluation appears in 34.1% of papers.

Known Gaps

Only 3.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.1% coverage).
Benchmark coverage is thin (17.9% of papers mention benchmarks/datasets).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (ALFWorld vs DROP) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: ALFWorld Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

Personalized RewardBench: Evaluating Reward Models with Human Aligned…

Highest protocol score with explicit human/eval signal plus Rewardbench.

Strongest benchmark reference

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step To…

Tracesafe-Bench with accuracy gives a fast comparison anchor.

Strongest recent paper

DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

Useful for current practice scanning; published Mar 23, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Apr 8, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference, Rubric Rating · Eval: Human Eval, Automatic Metrics · Benchmark: Rewardbench · Metric: Accuracy
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Apr 8, 2026 · Citations: 0 · Score: 8.0

HF: Red Team · Eval: Automatic Metrics · Benchmark: Tracesafe Bench · Metric: Accuracy
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
Mar 23, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: MT Bench · Metric: Accuracy
Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
Mar 20, 2026 · Citations: 0 · Score: 7.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Kappa
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Mar 19, 2026 · Citations: 0 · Score: 6.5

HF: Demonstrations · Eval: Simulation Env · Benchmark: Mapg Bench · Metric: Not Reported
How Reliable is Language Model Micro-Benchmarking?
Oct 9, 2025 · Citations: 0 · Score: 6.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: MMLU · Metric: Accuracy

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization Apr 8, 2026	Yes Pairwise Preference , Rubric Rating	Human Eval , Automatic Metrics	Rewardbench	Accuracy , Helpfulness	Not Reported
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories Apr 8, 2026	Yes Red Team	Automatic Metrics	Tracesafe Bench	Accuracy	Not Reported
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment Mar 23, 2026	Yes Pairwise Preference	Automatic Metrics	MT Bench , AlpacaEval	Accuracy	Not Reported
Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation Mar 20, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Kappa , Faithfulness	Inter Annotator Agreement Reported
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation Mar 19, 2026	Yes Demonstrations	Simulation Env	Mapg Bench	Not Reported	Not Reported
How Reliable is Language Model Micro-Benchmarking? Oct 9, 2025	Yes Pairwise Preference	Automatic Metrics	MMLU , MMLU Pro	Accuracy , Cost	Not Reported
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization Mar 30, 2026	Yes Critique Edit	Not Reported	Kernelbench	Not Reported	Not Reported
Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training Mar 12, 2026	Yes Pairwise Preference	Not Reported	LMSYS Chatbot Arena , Arena Hard	Not Reported	Not Reported
Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure Mar 23, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Kappa	Not Reported
ReDAct: Uncertainty-Aware Deferral for LLM Agents Apr 8, 2026	No Not Reported	Simulation Env	ALFWorld	Cost , Token cost	Not Reported
Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks Mar 4, 2026	Yes Demonstrations	Simulation Env	MiniWoB++	Not Reported	Not Reported
IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning Sep 26, 2025	Yes Demonstrations	Automatic Metrics	Not Reported	Accuracy , Cost	Calibration

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	Personalized RewardBench: Evaluating Reward Models…	TraceSafe: A Systematic Assessment of LLM Guardrail…	DSPA: Dynamic SAE Steering for Data-Efficient Prefe…
Human Feedback	Pairwise Preference, Rubric Rating	Red Team	Pairwise Preference
Evaluation Modes	Human Eval, Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	Rewardbench	Tracesafe Bench	MT Bench, AlpacaEval
Metrics	Accuracy, Helpfulness	Accuracy	Accuracy
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Pairwise	Trajectory	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (44)
Demonstrations (19)
Red Team (18)
Critique Edit (5)

Evaluation Modes

Automatic Metrics (51)
Simulation Env (18)
Llm As Judge (7)
Human Eval (4)

Top Benchmarks

ALFWorld (3)
DROP (2)
LMSYS Chatbot Arena (2)
WebShop (2)

Top Metrics

Accuracy (22)
Cost (10)
Helpfulness (5)
Success rate (5)

Rater Population Mix

Domain Experts (9)
Mixed (1)

Quality Controls

Calibration (2)
Adjudication (1)
Inter Annotator Agreement Reported (1)

Coverage diagnostics (sample-based): human-feedback 78.3% · benchmarks 36.7% · metrics 48.3% · quality controls 6.7%.

Top Papers

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play
Zelai Xu, Ruize Zhang, Chao Yu, Huining Yuan, Xiangmin Yi · Feb 4, 2025 · Citations: 0

Demonstrations Automatic MetricsSimulation Env Multi Agent

We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative reinforcement learning (RL), multi-agent reinforcement…
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0

Red Team Automatic Metrics Long Horizon

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
Measuring AI Ability to Complete Long Software Tasks
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia · Mar 18, 2025 · Citations: 0

Expert Verification Automatic Metrics Tool Use

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen · Mar 19, 2026 · Citations: 0

Demonstrations Simulation Env Multi Agent

To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component.
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026 · Citations: 0

Pairwise Preference Simulation Env Long Horizon

Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright · Mar 3, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon

Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
RAPTOR: A Foundation Policy for Quadrotor Control
Jonas Eschmann, Dario Albani, Giuseppe Loianno · Sep 15, 2025 · Citations: 0

Demonstrations Simulation Env Long Horizon

Humans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car.
Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
Richard J. Young · Mar 20, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Three classifiers (a regex-only detector, a regex-plus-LLM pipeline, and a Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters.
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem
Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue · Feb 16, 2026 · Citations: 0

Llm As Judge Web Browsing

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation.
Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure
Davide Di Gioia · Mar 23, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

Autonomous agents operating in continuous environments must decide not only what to do, but when to act.
How Reliable is Language Model Micro-Benchmarking?
Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta · Oct 9, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark.
ReDAct: Uncertainty-Aware Deferral for LLM Agents
Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov · Apr 8, 2026 · Citations: 0

Simulation Env Long Horizon

Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025 · Citations: 0

Demonstrations Simulation Env Long Horizon

Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
SPACeR: Self-Play Anchoring with Centralized Reference Models
Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka · Oct 20, 2025 · Citations: 0

Demonstrations Simulation Env Multi Agent

Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng · Mar 4, 2026 · Citations: 0

Demonstrations Simulation Env

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects…
Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0

Demonstrations Simulation Env

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning
Aayush Mishra, Daniel Khashabi, Anqi Liu · Sep 26, 2025 · Citations: 0

Demonstrations Automatic Metrics

Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and two model families.
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith · Mar 23, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility.
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen · Mar 3, 2026 · Citations: 0

Red Team Automatic Metrics Web Browsing

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs.
TREX: Trajectory Explanations for Multi-Objective Reinforcement Learning
Dilina Rajapakse, Juan C. Rosero, Ivana Dusparic · Mar 23, 2026 · Citations: 0

Pairwise Preference Long Horizon

Multi-Objective Reinforcement Learning (MORL) addresses this limitation by enabling agents to optimize several objectives simultaneously, explicitly reasoning about trade-offs between them.
Discovering Implicit Large Language Model Alignment Objectives
Edward Chen, Sanmi Koyejo, Carlos Guestrin · Feb 17, 2026 · Citations: 0

Rubric Rating Human Eval

To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai · Mar 30, 2026 · Citations: 0

Critique Edit Long Horizon

We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe.
MemRerank: Preference Memory for Personalized Product Reranking
Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong · Mar 31, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch.
SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray · Feb 24, 2026 · Citations: 0

Simulation Env Long Horizon

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning.
Who can we trust? LLM-as-a-jury for Comparative Assessment
Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill · Feb 18, 2026 · Citations: 0

Pairwise Preference

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements.
HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning
Zhicong Lu, Zichuan Lin, Wei Jia, Changyuan Tian, Deheng Ye · Mar 19, 2026 · Citations: 0

Pairwise Preference Long Horizon

While large language models excel in diverse domains, their performance on complex longhorizon agentic decision-making tasks remains limited.
TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026 · Citations: 0

Demonstrations Web Browsing

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
IROSA: Interactive Robot Skill Adaptation using Natural Language
Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp · Mar 4, 2026 · Citations: 0

Demonstrations Long Horizon

We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and…
Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus
Anna Van Elst, Kerrian Le Caillec, Igor Colin, Stephan Clémençon · Feb 26, 2026 · Citations: 0

Pairwise Preference Multi Agent

The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originating in social choice theory, have been documented in the literature, offering theoretical…
Text-to-Stage: Spatial Layouts from Long-form Narratives
Jefferson Hernandez, Swarnadeep Saha, Chenxi Whitehouse, Sanjeel Parekh, Calvin Murdock · Mar 18, 2026 · Citations: 0

Pairwise Preference Llm As Judge

In this work, we probe the ability of a language model to demonstrate spatial reasoning from unstructured text, mimicking human capabilities and automating a process that benefits many downstream media applications.
AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications.
Generating Fine Details of Entity Interactions
Xinyi Gu, Jiayuan Mao · Apr 11, 2025 · Citations: 0

Critique Edit Human Eval

However, images should also encapsulate rich interactions between objects, where existing models often fall short, likely due to limited training data and benchmarks for rare interactions.
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang · Feb 3, 2025 · Citations: 0

Pairwise Preference Llm As Judge

Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development.
Evaluation of Large Language Models via Coupled Token Generation
Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco · Feb 3, 2025 · Citations: 0

Pairwise Preference

In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning.
ActionParty: Multi-Subject Action Binding in Generative Video Games
Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov · Apr 2, 2026 · Citations: 0

Automatic MetricsSimulation Env Multi Agent

However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene.
IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs
Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin · Mar 11, 2026 · Citations: 0

Red Team Automatic Metrics

IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections.
Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models
Punyajoy Saha, Sudipta Halder, Debjyoti Mondal, Subhadarshi Panda · Mar 7, 2026 · Citations: 0

Pairwise PreferenceRed Team Automatic Metrics

Safety alignment is critical for deploying large language models (LLMs) in real-world applications, yet most existing approaches rely on large human-annotated datasets and static red-teaming benchmarks that are costly, difficult to scale,…
MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
Ashutosh Chaubey, Jiacheng Pang, Mohammad Soleymani · Mar 3, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs.
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu · Feb 21, 2026 · Citations: 0

Red Team Automatic Metrics

We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold.
When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi · Jun 9, 2025 · Citations: 0

Red Team Automatic Metrics

In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment.
Maximizing Asynchronicity in Event-based Neural Networks
Haiqing Hao, Nikola Zubić, Weihua He, Zhipeng Sui, Davide Scaramuzza · May 16, 2025 · Citations: 0

Demonstrations Automatic Metrics

Event cameras deliver visual data with high temporal resolution, low latency, and minimal redundancy, yet their asynchronous, sparse sequential nature challenges standard tensor-based machine learning (ML).
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen · Feb 24, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Vision-language alignment is crucial for various downstream tasks such as cross-modal generation and retrieval.
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit · Apr 7, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics

We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without…
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding · Feb 25, 2026 · Citations: 0

Simulation Env Long Horizon

We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
Yifei Xu, Guilherme Potje, Shivam Shandilya, Tiancheng Yuan, Leonardo de Oliveira Nunes · Feb 24, 2026 · Citations: 0

Rubric RatingRed Team

We present SibylSense, an inference-time learning approach that adapts a frozen rubric generator through a tunable memory bank of validated rubric items.
Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar · Feb 20, 2026 · Citations: 0

Pairwise Preference Long Horizon

When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
Investigation for Relative Voice Impression Estimation
Kenichi Fujita, Yusuke Ijima · Feb 15, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright'').
Learning to Answer from Correct Demonstrations
Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Prasad Kasiviswanathan, Cong Ma · Oct 17, 2025 · Citations: 0

Demonstrations Automatic Metrics

We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time.
Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong · Sep 25, 2025 · Citations: 0

Rubric Rating Automatic Metrics

Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs.
Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards
Cheng Jiayang, Xin Liu, Zhihan Zhang, Haoyang Wen, Zixuan Zhang · Mar 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present a framework addressing both challenges.
RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration
Srikumar Nayak · Feb 26, 2026 · Citations: 0

Automatic Metrics Multi Agent

This paper proposes RLShield, a practical multi-agent RL pipeline for financial cyber defense.
LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning
Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao · Feb 6, 2026 · Citations: 0

Automatic Metrics Long Horizon

Across diverse chemical reasoning benchmarks, LatentChem achieves a 59.88\% non-tie win rate over strong CoT-based baselines on ChemCoTBench, while delivering a 10.84\times average reduction in reasoning overhead.
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar · Feb 3, 2026 · Citations: 0

Automatic Metrics Web Browsing

To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts.
Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
Yixin Liu, Yue Yu, DiJia Su, Sid Wang, Xuewei Wang · Mar 12, 2026 · Citations: 0

Pairwise Preference

Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked.
SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests
Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj, Rada Mihalcea, Zhijing Jin · Oct 6, 2025 · Citations: 0

Critique Edit

Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control.
Less is More: Improving LLM Alignment via Preference Data Selection
Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang · Feb 20, 2025 · Citations: 0

Pairwise Preference

Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences.
Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving
Yinan Zheng, Tianyi Tan, Bin Huang, Enguang Liu, Ruiming Liang · Feb 26, 2026 · Citations: 0

Simulation Env Long Horizon

However, their applications and evaluations in autonomous driving remain limited to simulation-based or laboratory settings.
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Hanjiang Hu, Alexander Robey, Changliu Liu · Feb 28, 2025 · Citations: 0

Red Team

To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues.
LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications
Mayank Mayank, Bharanidhar Duraisamy, Florian Geiss · Apr 2, 2026 · Citations: 0

Automatic Metrics Long Horizon

Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset…

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote