OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 304 Search mode: keyword RSS

Filter by tag

All Automatic Metrics (978) General (590) Coding (314) Simulation Env (115) Math (103) Multilingual (99) Long Horizon (82) Medicine (78) Pairwise Preference (70) Law (45) Multi Agent (41) Human Eval (38) Expert Verification (25) Web Browsing (22) Critique Edit (21) Red Team (21)

Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs

Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel, Jakob Foerster · Jun 23, 2025

Citations: 0

Demonstrations Automatic Metrics Coding

Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important implica

DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries

Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto · Jun 20, 2025

Citations: 0

Expert Verification Llm As Judge Medicine

This study introduces DistillNote, an evaluation framework for LLM summaries that targets their functional utility by applying the generated summary downstream in a complex clinical prediction task, explicitly quantifying how much predictio
We contrasted DistillNote's results with evaluations from LLM-as-judge and clinicians, assessing consistency across different evaluation methods.

Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models

Maximilian Kreutner, Marlene Lutz, Markus Strohmaier · Jun 13, 2025

Citations: 0

Automatic MetricsSimulation Env Coding

From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise

Nitin Sharma, Thomas Wolfers, Çağatay Yıldız · Jun 9, 2025

Citations: 0

Expert Verification Automatic Metrics Law

Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education.
However, existing benchmarks are documented to be contaminated and are based on multiple choice questions, which suffer from inherent biases.

When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi · Jun 9, 2025

Citations: 0

Red Team Automatic Metrics General

In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment.
We first define ASR inflation as the increase in ASR due to style patterns in existing jailbreak benchmark queries.

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche · Jun 4, 2025

Citations: 0

Simulation Env Web Browsing MathCoding

HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang · Jun 4, 2025

Citations: 0

Expert Verification Automatic Metrics Multilingual

However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences
Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations.

EuroGEST: Investigating gender stereotypes in multilingual language models

Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch · Jun 4, 2025

Citations: 0

Human EvalAutomatic Metrics CodingMultilingual

Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric.
EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics.

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang · Jun 3, 2025

Citations: 0

Critique Edit Automatic Metrics General

Automated Web Application Testing: End-to-End Test Case Generation with Large Language Models and Screen Transition Graphs

Nguyen-Khang Le, Quan Minh Bui, Minh Ngoc Nguyen, Hiep Nguyen, Trung Vo, Son T. Luu · Jun 3, 2025

Citations: 0

Automatic Metrics Web Browsing General

iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering

Shuai Wang, Yinan Yu · Jun 2, 2025

Citations: 0

Automatic Metrics Long Horizon Coding

Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.

Synthesis of discrete-continuous quantum circuits with multimodal diffusion models

Florian Fürrutter, Zohim Chandani, Ikko Hamamura, Hans J. Briegel, Gorka Muñoz-Gil · Jun 2, 2025

Citations: 0

Automatic MetricsSimulation Env General

We benchmark the model over different experiments, analyzing the method's accuracy across varying qubit counts and circuit depths, showcasing the ability of the method to outperform existing approaches in gate counts and under noisy conditi

Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages

Kaja Dobrovoljc · May 28, 2025

Citations: 0

Pairwise Preference Automatic Metrics General

Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities.
Strikingly, the overlap between spoken and written syntactic inventories is very limited: most structures attested in speech do not occur in writing, pointing to modality-specific preferences in syntactic organization that reflect the disti

RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier, Yu Su · May 28, 2025

Citations: 0

Red Team Simulation Env Web Browsing General

Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection.
Current evaluations of this threat either lack support realistic but controlled environments or ignore hybrid web-OS attack scenarios involving both interfaces.

HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning

Chuhao Zhou, Jianfei Yang · May 23, 2025

Citations: 0

Automatic MetricsSimulation Env Coding

Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language.
In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning acros

Refusal Direction is Universal Across Safety-Aligned Languages

Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank · May 22, 2025

Citations: 0

Red Team Automatic Metrics Multilingual

Refusal mechanisms in large language models (LLMs) are essential for ensuring safety.
In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages.

Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task

Mengyang Qiu, Zoe Brisebois, Siena Sun · May 22, 2025

Citations: 0

Pairwise Preference Simulation Env General

Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear.
We evaluated 34 distinct models across 45 configurations from major closed-source and open-source providers, and compared outputs to responses from 106 human participants.

VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai, Yang Liu · May 21, 2025

Citations: 0

Pairwise Preference Automatic Metrics General

However, existing reward benchmarks focus on preference comparisons between responses rather than evaluating verification against ground truth references, leaving a critical gap in our ability to evaluate verification systems used in reason
In this paper, we introduce VerifyBench and its challenging variant VerifyBench-Hard, two benchmarks specifically designed to assess reference-based reward systems.

Visual Planning: Let's Think Only with Images

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen · May 16, 2025

Citations: 0

Automatic Metrics Web Browsing Coding

In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions.

Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications

Zhanliang Wang, Da Wu, Quan Nguyen, Zhuoran Xu, Kai Wang · May 9, 2025

Citations: 0

Pairwise Preference Automatic Metrics MedicineCoding

To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from multimodal biomedical data through preference optimiz
While MINT supports different optimization techniques, we primarily implement it with the Odds Ratio Preference Optimization (ORPO) framework as its backbone.

Protocol Hubs

Expert Verification Papers (25) CS.CL + Expert Verification Papers (20) Pairwise Preference Papers (70) CS.CL + Pairwise Preference Papers (62) CS.AI + Expert Verification Papers (15) CS.AI + Pairwise Preference Papers (42) Rubric Rating Papers (17) CS.CL + Rubric Rating Papers (16) General + Pairwise Preference Papers (43) Expert Verification Or Rubric Rating Papers (39) CS.CL + Math Papers (84) Long Horizon Papers (82) CS.CL + Human Eval Papers (35) CS.CL + Long Horizon Papers (58) Expert Verification + Medicine Papers (11) Human Eval Papers (38)

Human Feedback and Eval Paper Explorer

Filter by tag

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives