Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 15 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong · Feb 11, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 55% High protocol signal Freshness: Hot Status: Ready
Pairwise Preference Tool Use MathCoding
  • We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
  • Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and…
Open paper
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference

Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 55% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Tool Use General
  • Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
Open paper
A Benchmark for Deep Information Synthesis

Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov, Lena Sophia Bolliger · Feb 24, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Tool Use Coding
  • To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights.
  • When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, underscoring the difficulty of the benchmark.
Open paper
REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents

Zheng Chu, Xiao Wang, Jack Hong, Huiming Fan, Yuqi Huang, Yue Yang · Feb 15, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 55% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Tool Use Coding
  • To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization.
  • Across both textonly and multimodal searchagent benchmarks, our approach achieves stateoftheart performance.
Open paper
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li · Feb 12, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 55% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Tool Use Coding
  • To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
  • To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional…
Open paper
Tucano 2 Cool: Better Open Source LLMs for Portuguese

Nicholas Kluge Corrêa, Aniket Sen, Shiza Fatimah, Sophia Falk, Lennard Landgraf, Julia Kastner · Mar 3, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback
Pairwise Preference Tool Use Coding
  • Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two…
  • Through extensive ablation studies, we design both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling…
Open paper
PyVision-RL: Forging Open Agentic Vision Models via RL

Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang · Feb 24, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Tool Use General
  • Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior.
  • Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.
Open paper
What Matters For Safety Alignment?

Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan · Jan 7, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 48% High protocol signal Freshness: Warm Status: Ready
Red Team Automatic Metrics Tool Use General
  • This paper presents a comprehensive empirical study on the safety alignment capabilities.
  • We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems.
Open paper
Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 48% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Tool Use Multilingual
  • On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling.
Open paper
STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models

Jiliang Ni, Jiachen Pu, Zhongyi Yang, Jingfeng Luo, Conggang Hu · Feb 3, 2026

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 45% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Tool Use General
  • The proliferation of Large Language Models (LLMs) in function calling is pivotal for creating advanced AI agents, yet their large scale hinders widespread adoption, necessitating transferring their capabilities into smaller ones.
  • Extensive experiments on challenging and renowned benchmarks demonstrate the effectiveness of our method.
Open paper
Measuring AI Ability to Complete Long Software Tasks

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin · Mar 18, 2025

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 43% High protocol signal Freshness: Cold Status: Ready
Expert Verification Automatic Metrics Tool Use General
  • Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
  • To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon.
Open paper
LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu · Oct 21, 2025

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 43% High protocol signal Freshness: Cold Status: Fallback
Automatic Metrics Tool Use Coding
  • Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages.
Open paper
LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

Guozhao Mo, Wenliang Zhong, Jiawei Chen, Qianhao Yuan, Xuanang Chen, Yaojie Lu · Aug 3, 2025

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 43% High protocol signal Freshness: Cold Status: Fallback
Llm As Judge Tool Use MedicineCoding
  • Unfortunately, there is still a large gap between real-world MCP usage and current evaluation: they typically assume single-server settings and directly inject tools into the model's context, bypassing the challenges of large-scale…
  • We benchmark 12 state-of-the-art LLMs and observe a substantial performance gap: while Claude-Sonnet-4 reaches 78.95% task success, most models achieve only 30-50%.
Open paper
RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning

Xiao Liu, Da Yin, Zirui Wu, Yansong Feng · May 27, 2025

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 43% Moderate protocol signal Freshness: Cold Status: Fallback
Automatic Metrics Tool Use Multilingual
  • Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 12.3% on average accuracy, while being cost-efficient and broadly generalizable…
Open paper
Should You Use Your Large Language Model to Explore or Exploit?

Keegan Harris, Aleksandrs Slivkins · Jan 31, 2025

Citations: 0

Match reason: Matches selected tags (Tool Use).

Score: 40% Moderate protocol signal Freshness: Cold Status: Fallback
Automatic Metrics Tool Use General
  • We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff.
Open paper

Protocol Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.