Subrit Dikshit · Feb 18, 2026
OpenTrain Research Tools
Human Feedback and Eval Paper Explorer
A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.
Filter by tag
Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill · Feb 18, 2026
- Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements.
- Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability.
Vered Tohar, Tsahi Hayat, Amir Leshem · Feb 18, 2026
- In a blinded authorship test with 50 humanities students and graduates (three AI poems and three poems by well-known poets each), judgments were at chance: human poems were labeled human 54% of the time and AI poems 52%, with 95% confidence
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026
- Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
- To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm.
Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif, Segev Shlomov · Feb 18, 2026
- Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating,
- We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces.
Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026
- LLM-based agents execute real-world workflows via tools and memory.
- These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios.
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin · Feb 18, 2026
- Existing evaluations of agents with memory typically assess memorization and action in isolation.
- One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions.
Kaiqu Liang, Julia Kruk, Shengyi Qian, Xianjun Yang, Shengjie Bi, Yuanshun Yao · Feb 18, 2026
- Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users.
- Prior approaches typically rely on static datasets, either training implicit preference models on interaction history or encoding user profiles in external memory.
Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli, Majid Sarrafzadeh · Feb 17, 2026
- While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
- We survey 335 individuals with lived mental health experience to collect preference rankings across therapeutic dimensions, then develop a multi-objective alignment framework using direct preference optimization.
Shahriar Golchin, Marc Wetter · Feb 17, 2026
- We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice.
- We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks.
GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du · Feb 17, 2026
- We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering.
- Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity.
Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé · Feb 17, 2026
- In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences.
- We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset.
Roksana Goworek, Haim Dubossarsky · Feb 17, 2026
Tim Fischer, Chris Biemann · Feb 17, 2026
- This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.
- Perspectives implements a flexible, aspect-focused document clustering pipeline with human-in-the-loop refinement capabilities.
Mohammad Aflah Khan, Mahsa Amani, Soumi Das, Bishwamittra Ghosh, Qinyuan Wu, Krishna P. Gummadi · Feb 17, 2026
- Agents based on Large Language Models (LLMs) are increasingly being deployed as interfaces to information on online platforms.
- These agents filter, prioritize, and synthesize information retrieved from the platforms' back-end databases or via web search.
Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li, Shengyu Zhang · Feb 17, 2026
- Web agents based on large language models have demonstrated promising capability in automating web tasks.
- However, current web agents struggle to reason out sensible actions due to the limitations of predicting environment changes, and might not possess comprehensive awareness of execution risks, prematurely performing risky actions that cause
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He, Feijie Wu · Feb 17, 2026
- Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and informati
- By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver's visual pathway, effectively treating the vision encoder as a universal port fo
Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang, Li Qing · Feb 17, 2026
- Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
- We first define the components and evaluation metrics for TOFs, then formalize a cost-efficient flowchart construction algorithm to abstract procedural knowledge from service dialogues.
Edward Chen, Sanmi Koyejo, Carlos Guestrin · Feb 17, 2026
- To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.
- Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework's robustness.
Victor De Lima, Jiqun Liu, Grace Hui Yang · Feb 17, 2026
- Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence.
- Human evaluation further confirms that FrameRef's generated framings measurably affect human judgment.
Protocol Hubs
Benchmark Hubs
- Retrieval Benchmark Papers (115)
- Retrieval Benchmark Papers (Last 365 Days) (111)
- Retrieval Or GSM8K Benchmark Papers (128)
- Retrieval Or MMLU Benchmark Papers (128)
- Retrieval Or MATH Benchmark Papers (135)
- Retrieval Or DROP Benchmark Papers (129)
- Retrieval Or MMLU Or GSM8K Benchmark Papers (139)
- Retrieval Or MATH Or GSM8K Benchmark Papers (148)
- Retrieval Or MATH Or MMLU Benchmark Papers (146)
- Retrieval Or DROP Or GSM8K Benchmark Papers (141)