OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 84 Search mode: keyword RSS

Filter by tag

All Automatic Metrics (966) General (585) Coding (310) Simulation Env (115) Math (102) Multilingual (97) Long Horizon (81) Medicine (78) Pairwise Preference (70) Law (45) Multi Agent (41) Human Eval (38) Expert Verification (25) Web Browsing (22) Critique Edit (21) Red Team (21)

World Simulation with Video Foundation Models for Physical AI

NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji · Oct 28, 2025

Citations: 0

Simulation Env Long Horizon CodingMultilingual

These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nv

Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language

Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab · Oct 27, 2025

Citations: 0

Human EvalAutomatic Metrics Coding

We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural n
Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English.

Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang · Oct 27, 2025

Citations: 0

Pairwise Preference Human Eval Coding

Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation.
However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation.

Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu, Lingkai Kong · Oct 14, 2025

Citations: 0

Pairwise Preference Automatic Metrics Coding

Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference.

Mapping Semantic & Syntactic Relationships with Geometric Rotation

Michael Freenor, Lauren Alvarez · Oct 10, 2025

Citations: 0

Demonstrations Automatic Metrics CodingMultilingual

PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity

Zixin Song, Bowen Zhang, Qian-Wen Zhang, Di Yin, Xing Sun, Chunping Li · Oct 5, 2025

Citations: 0

Pairwise Preference Automatic Metrics Coding

On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture.

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan · Sep 28, 2025

Citations: 0

Pairwise Preference Automatic Metrics Coding

These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning.
We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined.

RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility

Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang · Sep 27, 2025

Citations: 0

Automatic Metrics Long Horizon Coding

Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors.
To address this, we introduce RHYTHM (Reasoning with Hierarchical Temporal Tokenization for Human Mobility), a unified framework that leverages large language models (LLMs) as general-purpose spatio-temporal predictors and trajectory reason

CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures

Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · Aug 16, 2025

Citations: 0

Pairwise Preference Automatic Metrics Multi Agent LawCoding

Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified.
In this paper, we present the Conversational Robustness Evaluation Score: CORE, a metric to quantify the effectiveness of language use within multi-agent systems across different game-theoretic interactions.

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

Guozhao Mo, Wenliang Zhong, Jiawei Chen, Qianhao Yuan, Xuanang Chen, Yaojie Lu · Aug 3, 2025

Citations: 0

Automatic Metrics Tool Use MedicineCoding

Unfortunately, there is still a large gap between real-world MCP usage and current evaluation: they typically assume single-server settings and directly inject tools into the model's context, bypassing the challenges of large-scale retrieva
The benchmark includes a ready-to-deploy tool suite of 70 servers with 527 tools, ensuring reproducibility without scattered API configuration.

GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning

Jie Peng, Jiarui Ji, Runlin Lei, Zhewei Wei, Yongchao Liu, Chuntao Hong · Jul 4, 2025

Citations: 0

Automatic Metrics Multi Agent Coding

Additionally, prior work mainly focuses on discriminative tasks on DyTAGs, resulting in a lack of standardized task formulations and evaluation protocols tailored for DyTAG generation.
To address these critical issues, we propose Generative DyTAG Benchmark (GDGB), which comprises eight meticulously curated DyTAG datasets with high-quality textual features for both nodes and edges, overcoming limitations of prior datasets.

Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs

Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel, Jakob Foerster · Jun 23, 2025

Citations: 0

Demonstrations Automatic Metrics Coding

Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important implica

Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models

Maximilian Kreutner, Marlene Lutz, Markus Strohmaier · Jun 13, 2025

Citations: 0

Automatic MetricsSimulation Env Coding

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche · Jun 4, 2025

Citations: 0

Simulation Env Web Browsing MathCoding

EuroGEST: Investigating gender stereotypes in multilingual language models

Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch · Jun 4, 2025

Citations: 0

Human EvalAutomatic Metrics CodingMultilingual

Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric.
EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics.

iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering

Shuai Wang, Yinan Yu · Jun 2, 2025

Citations: 0

Automatic Metrics Long Horizon Coding

Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.

HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning

Chuhao Zhou, Jianfei Yang · May 23, 2025

Citations: 0

Automatic MetricsSimulation Env Coding

Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language.
In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning acros

Visual Planning: Let's Think Only with Images

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen · May 16, 2025

Citations: 0

Automatic Metrics Web Browsing Coding

In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions.

Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications

Zhanliang Wang, Da Wu, Quan Nguyen, Zhuoran Xu, Kai Wang · May 9, 2025

Citations: 0

Pairwise Preference Automatic Metrics MedicineCoding

To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from multimodal biomedical data through preference optimiz
While MINT supports different optimization techniques, we primarily implement it with the Odds Ratio Preference Optimization (ORPO) framework as its backbone.

Reshaping MOFs text mining with a dynamic multi-agents framework of large language model

Zuhong Lin, Daoyuan Ren, Kai Ran, Jing Sun, Songlin Yu, Xuefeng Bai · Apr 26, 2025

Citations: 0

Automatic Metrics Multi Agent Coding

Protocol Hubs

Expert Verification Papers (23) CS.CL + Pairwise Preference Papers (56) Pairwise Preference Papers (64) CS.AI + Pairwise Preference Papers (39) General + Pairwise Preference Papers (38) CS.CL + Expert Verification Papers (18) Automatic Metrics + Pairwise Preference Papers (51) Expert Verification Or Rubric Rating Papers (36) CS.CL + Medicine Papers (52) Automatic Metrics + Expert Verification Papers (19) Human Eval Papers (36) CS.CL + Math Papers (71) CS.CL + Human Eval Papers (33) Long Horizon Papers (74) Critique Edit Or Expert Verification Papers (41) Automatic Metrics + General + Pairwise Preference Papers (29)

Human Feedback and Eval Paper Explorer

Filter by tag

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives