Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 15 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (631) General (245) Long Horizon (128) Pairwise Preference (128) Coding (97) Simulation Env (86) Multi Agent (63) Medicine (42) Expert Verification (41) Llm As Judge (40) Web Browsing (35) Rubric Rating (33) Demonstrations (31) Human Eval (30) Red Team (30) Math (29)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong · Feb 11, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 55% High protocol signal Freshness: Hot Status: Ready

Pairwise Preference Tool Use MathCoding

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and…

Open paper

Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference

Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 55% High protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Tool Use General

Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.

Open paper

A Benchmark for Deep Information Synthesis

Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov, Lena Sophia Bolliger · Feb 24, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Tool Use Coding

To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights.
When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, underscoring the difficulty of the benchmark.

Open paper

REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents

Zheng Chu, Xiao Wang, Jack Hong, Huiming Fan, Yuqi Huang, Yue Yang · Feb 15, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Tool Use Coding

To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization.
Across both textonly and multimodal searchagent benchmarks, our approach achieves stateoftheart performance.

Open paper

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li · Feb 12, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 55% High protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Tool Use Coding

To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional…

Open paper

Tucano 2 Cool: Better Open Source LLMs for Portuguese

Nicholas Kluge Corrêa, Aniket Sen, Shiza Fatimah, Sophia Falk, Lennard Landgraf, Julia Kastner · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback

Pairwise Preference Tool Use Coding

Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two…
Through extensive ablation studies, we design both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling…

Open paper

PyVision-RL: Forging Open Agentic Vision Models via RL

Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang · Feb 24, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback

Automatic Metrics Tool Use General

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior.
Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.

Open paper

What Matters For Safety Alignment?

Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan · Jan 7, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 48% High protocol signal Freshness: Warm Status: Ready

Red Team Automatic Metrics Tool Use General

This paper presents a comprehensive empirical study on the safety alignment capabilities.
We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems.

Open paper

EnsembleLink: Accurate Record Linkage Without Training Data

Noah Dasanaike · Jan 29, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 48% Moderate protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Tool Use Multilingual

On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling.

Open paper

Jiliang Ni, Jiachen Pu, Zhongyi Yang, Jingfeng Luo, Conggang Hu · Feb 3, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 45% Moderate protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Tool Use General

The proliferation of Large Language Models (LLMs) in function calling is pivotal for creating advanced AI agents, yet their large scale hinders widespread adoption, necessitating transferring their capabilities into smaller ones.
Extensive experiments on challenging and renowned benchmarks demonstrate the effectiveness of our method.

Open paper

Measuring AI Ability to Complete Long Software Tasks

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin · Mar 18, 2025

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 43% High protocol signal Freshness: Cold Status: Ready

Expert Verification Automatic Metrics Tool Use General

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon.

Open paper

LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu · Oct 21, 2025

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 43% High protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Tool Use Coding

Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages.

Open paper

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

Guozhao Mo, Wenliang Zhong, Jiawei Chen, Qianhao Yuan, Xuanang Chen, Yaojie Lu · Aug 3, 2025

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 43% High protocol signal Freshness: Cold Status: Fallback

Llm As Judge Tool Use MedicineCoding

Unfortunately, there is still a large gap between real-world MCP usage and current evaluation: they typically assume single-server settings and directly inject tools into the model's context, bypassing the challenges of large-scale…
We benchmark 12 state-of-the-art LLMs and observe a substantial performance gap: while Claude-Sonnet-4 reaches 78.95% task success, most models achieve only 30-50%.

Open paper

RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning

Xiao Liu, Da Yin, Zirui Wu, Yansong Feng · May 27, 2025

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 43% Moderate protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Tool Use Multilingual

Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 12.3% on average accuracy, while being cost-efficient and broadly generalizable…

Open paper

Should You Use Your Large Language Model to Explore or Exploit?

Keegan Harris, Aleksandrs Slivkins · Jan 31, 2025

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 40% Moderate protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Tool Use General

We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff.

Open paper

Protocol Hubs

Expert Verification Papers (39) CS.CL + Expert Verification Papers (30) Rubric Rating Papers (29) CS.CL + Rubric Rating Papers (27) CS.AI + Expert Verification Papers (25) Pairwise Preference Papers (114) CS.CL + Pairwise Preference Papers (99) Coding Papers (91) CS.CL Human Feedback And Eval Papers (1,674) CS.CL + Coding Papers (73) Expert Verification Papers (Last 120 Days) (30) Medicine + Expert Verification Papers (20) Expert Verification Papers (Last 90 Days) (29) Medicine Papers (40) CS.AI + Medicine Papers (25) Automatic Metrics + Expert Verification Papers (25)

Benchmark Hubs

Metric Hubs

Daily Archives

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives