Research Utility Snapshot
Evaluation Modes
- Llm As Judge (7)
- Automatic Metrics (2)
- Human Eval (2)
Human Feedback Types
- Expert Verification (2)
- Rubric Rating (2)
- Pairwise Preference (1)
Overton Pluralistic Reinforcement Learning for Large Language Models Yu Fu, Seongho Son, Ilija Bogunovic · Feb 24, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics General
- Existing alignment paradigms remain limited in capturing the pluralistic nature of human values.
- First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses.
Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning Justin Lovelace, Christian Belardi, Sofian Zalouk, Adhitya Polavaram, Srivatsa Kundurthy, Kilian Q. Weinberger · Feb 24, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics General
- Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves $>70\%$ win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning.
World-Model-Augmented Web Agents with Action Correction Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li, Shengyu Zhang · Feb 17, 2026 · Citations: 0
Llm As JudgeSimulation Env General
- Web agents based on large language models have demonstrated promising capability in automating web tasks.
- However, current web agents struggle to reason out sensible actions due to the limitations of predicting environment changes, and might not possess comprehensive awareness of execution risks, prematurely performing risky actions that cause
HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong, Subhabrata Mukherjee · Jan 9, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Human EvalLlm As Judge General
- Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
- We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations.
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut · Oct 21, 2025 · Citations: 0
Rubric Rating Human EvalLlm As Judge General
- While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge.
- In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g.
EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025 · Citations: 0
Expert Verification Llm As JudgeSimulation Env General
- We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
- We introduce two types of agents: a scientist agent for planning, coordination, reflection, and generation of final results, and a task-expert agent to focus exclusively on one specific duty serving as a tool to the scientist agent.
DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto · Jun 20, 2025 · Citations: 0
Expert Verification Llm As Judge Medicine
- This study introduces DistillNote, an evaluation framework for LLM summaries that targets their functional utility by applying the generated summary downstream in a complex clinical prediction task, explicitly quantifying how much predictio
- We contrasted DistillNote's results with evaluations from LLM-as-judge and clinicians, assessing consistency across different evaluation methods.