Skip to content

OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 6 Search mode: keyword RSS
Overton Pluralistic Reinforcement Learning for Large Language Models

Yu Fu, Seongho Son, Ilija Bogunovic · Feb 24, 2026

Citations: 0
Llm As JudgeAutomatic Metrics General
  • Existing alignment paradigms remain limited in capturing the pluralistic nature of human values.
  • First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses.
Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning

Justin Lovelace, Christian Belardi, Sofian Zalouk, Adhitya Polavaram, Srivatsa Kundurthy, Kilian Q. Weinberger · Feb 24, 2026

Citations: 0
Llm As JudgeAutomatic Metrics General
  • Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves $>70\%$ win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning.
World-Model-Augmented Web Agents with Action Correction

Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li, Shengyu Zhang · Feb 17, 2026

Citations: 0
Llm As JudgeSimulation Env Multi Agent General
  • Web agents based on large language models have demonstrated promising capability in automating web tasks.
  • However, current web agents struggle to reason out sensible actions due to the limitations of predicting environment changes, and might not possess comprehensive awareness of execution risks, prematurely performing risky actions that cause
HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong, Subhabrata Mukherjee · Jan 9, 2026

Citations: 0
Pairwise PreferenceRubric Rating Human EvalLlm As Judge General
  • Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
  • We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations.
EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis

Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025

Citations: 0
Expert Verification Llm As JudgeSimulation Env Multi Agent General
  • We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
  • We introduce two types of agents: a scientist agent for planning, coordination, reflection, and generation of final results, and a task-expert agent to focus exclusively on one specific duty serving as a tool to the scientist agent.
DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries

Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto · Jun 20, 2025

Citations: 0
Expert Verification Llm As Judge Medicine
  • This study introduces DistillNote, an evaluation framework for LLM summaries that targets their functional utility by applying the generated summary downstream in a complex clinical prediction task, explicitly quantifying how much predictio
  • We contrasted DistillNote's results with evaluations from LLM-as-judge and clinicians, assessing consistency across different evaluation methods.

Protocol Hubs