Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke · May 19, 2025
Researcher Tools
Human Feedback and Eval Paper Explorer
A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.
Filter by tag
Kuldip Singh Atwal, Dieter Pfoser, Daniel Rothbart · Dec 8, 2025
Nischal Karki, Bipesh Subedi, Prakash Poudyal, Rupak Raj Ghimire, Bal Krishna Bal · Feb 27, 2026
Kexin Ma, Bojun Li, Yuhua Tang, Liting Sun, Ruochun Jin · Jan 14, 2026
- Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where.
- Experiments demonstrate that CAST has averagely improved 8.11% F1 and 10.21% J(LLM-as-a-Judge) than baselines on various datasets, especially on open and time-sensitive conversational questions.
Zhivar Sourati, Zheng Wang, Marianne Menglin Liu, Yazhe Hu, Mengqing Guo, Sujeeth Bharadwaj · Oct 8, 2025
Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham · Feb 26, 2026
C. Seas, G. Fitzpatrick, J. A. Hamilton, M. C. Carlisle · Feb 26, 2026
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin · Feb 18, 2026
- Existing evaluations of agents with memory typically assess memorization and action in isolation.
- To capture this setting, we introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops.
Jiasen Zheng, Zijun Zhou, Huajun Zhang, Junjiang Lin, Jingyun Jia, Qi Wang · Feb 27, 2026
Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu, Yuze Sun · Jun 25, 2025
- Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up-to-date knowledge sources.
- In human-phenotype-ontology-based tasks, it achieves an average Recall@1 of 57.18%, outperforming the next-best method by 23.79%; in multi-modal tests, it reaches 69.1% compared with Exomiser's 55.9% on 168 cases.
Zheng Chu, Xiao Wang, Jack Hong, Huiming Fan, Yuqi Huang, Yue Yang · Feb 15, 2026
- To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization.
- Across both textonly and multimodal searchagent benchmarks, our approach achieves stateoftheart performance.
Zitong Xu, Yuqing Wu, Yue Zhao · Feb 27, 2026
Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego · Feb 20, 2026
- The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets.
- Results Across all concepts, CUICurate produced substantially larger and more complete concept sets than the manual benchmarks whilst matching human precision.
Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026
- When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
- By prioritizing deterministic rules, clear decision tracking, and retaining unresolved cases for human review, the framework provides a practical foundation for downstream manufacturing automation in real-world industrial environments.
Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram, Rizwan Hamid · Feb 23, 2026
- Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype…
- Using clinician-curated HPO terms as the gold standard, RARE-PHENIX consistently outperformed a state-of-the-art deep learning baseline (PhenoBERT) across ontology-based similarity and precision-recall-F1 metrics in end-to-end evaluation…
Protocol Hubs
Benchmark Hubs
Metric Hubs
- Accuracy & Pass Rate Metric Papers (88)
- Accuracy Metric Papers (82)
- Accuracy & Pass Rate Metric Papers In CS.CL (63)
- Accuracy & Pass Rate Metric Papers + Automatic Metrics (74)
- Accuracy In CS.CL Papers (58)
- Accuracy & Pass Rate Metric Papers In CS.AI (58)
- Accuracy + Automatic Metrics Metric Papers (70)
- Accuracy + Automatic Metrics Metric Papers (Last 120 Days) (53)
- Accuracy + Automatic Metrics Metric Papers (Last 90 Days) (51)
- Accuracy + Automatic Metrics Metric Papers (Last 30 Days) (47)
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.