Research Utility Snapshot
Evaluation Modes
- Automatic Metrics (11)
- Simulation Env (3)
- Human Eval (1)
Human Feedback Types
- Expert Verification (1)
- Pairwise Preference (1)
- Red Team (1)
Required Expertise
- General (10)
- Coding (4)
- Law (1)
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026 · Citations: 0
Automatic Metrics General
- Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
A Benchmark for Deep Information Synthesis Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov, Lena Sophia Bolliger · Feb 24, 2026 · Citations: 0
Human EvalAutomatic Metrics Coding
- Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis.
- However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang · Feb 24, 2026 · Citations: 0
Simulation Env LawCoding
- Agentic systems increasingly rely on reusable procedural capabilities, \textit{a.k.a., agentic skills}, to execute long-horizon workflows reliably.
- This paper maps the skill layer across the full lifecycle (discovery, practice, distillation, storage, composition, evaluation, and update) and introduces two complementary taxonomies.
PyVision-RL: Forging Open Agentic Vision Models via RL Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang · Feb 24, 2026 · Citations: 0
Automatic Metrics General
- Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior.
- Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.
OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape, Yuhao Zhang · Feb 16, 2026 · Citations: 0
Simulation Env General
- Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks.
- While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general "search" APIs) are often opaque, lacking clear best practices or failure modes.
MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents Zhenhong Zhou, Yuanhe Zhang, Hongwei Cai, Moayad Aloqaily, Ouns Bouachir, Linsey Pang · Feb 15, 2026 · Citations: 0
Automatic Metrics General
- The Model Context Protocol (MCP) standardizes tool use for LLM-based agents and enable third-party servers.
- This openness introduces a security misalignment: agents implicitly trust tools exposed by potentially untrusted MCP servers.
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li · Feb 12, 2026 · Citations: 0
Automatic Metrics Coding
- To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
- To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zoo
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong · Feb 11, 2026 · Citations: 0
Pairwise Preference Simulation Env MathCoding
- We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
- We focus on what matters most when building agents: sharp reasoning and fast, reliable execution.
RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution Isaac Picov, Ritesh Goru · Feb 6, 2026 · Citations: 0
Automatic Metrics General
OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong, Haoran Luo · Feb 3, 2026 · Citations: 0
Automatic Metrics General
- To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning.
- Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries.
STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models Jiliang Ni, Jiachen Pu, Zhongyi Yang, Jingfeng Luo, Conggang Hu · Feb 3, 2026 · Citations: 0
Automatic Metrics General
- The proliferation of Large Language Models (LLMs) in function calling is pivotal for creating advanced AI agents, yet their large scale hinders widespread adoption, necessitating transferring their capabilities into smaller ones.
- Extensive experiments on challenging and renowned benchmarks demonstrate the effectiveness of our method.
What Matters For Safety Alignment? Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan · Jan 7, 2026 · Citations: 0
Red Team Automatic Metrics General
- This paper presents a comprehensive empirical study on the safety alignment capabilities.
- We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems.
Measuring AI Ability to Complete Long Software Tasks Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin · Mar 18, 2025 · Citations: 0
Expert Verification Automatic Metrics General
- Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
- To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon.
Should You Use Your Large Language Model to Explore or Exploit? Keegan Harris, Aleksandrs Slivkins · Jan 31, 2025 · Citations: 0
Automatic Metrics General
- We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff.