Skip to content
← Back to explorer

Tag: Tool Use

Tool Use evaluation setups appearing in the current HFEPX corpus (77 papers).

Papers in tag: 77

Running a Tool Use study?

Post a Job →

Research Utility Snapshot

Evaluation Modes

  • Automatic Metrics (13)
  • Simulation Env (1)

Human Feedback Types

  • Critique Edit (1)
  • Pairwise Preference (1)

Required Expertise

  • General (11)
  • Coding (8)
  • Law (1)
AMEL: Accumulated Message Effects on LLM Judgments

Sid-ali Temkit · May 21, 2026 · Citations: 0

Automatic Metrics Coding
  • Across 75,898 API calls to 11 models from 4 providers (OpenAI, Anthropic, Google, and four open-source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative…
  • The simplest fix for evaluation pipelines is a fresh context per item; when batching is unavoidable, balancing the history helps.
Orchard: An Open-Source Agentic Modeling Framework

Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Xiao Yu, Rui Yang · May 14, 2026 · Citations: 0

Automatic Metrics LawCoding
  • Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments.
  • We present Orchard, an open-source framework for scalable agentic modeling.
InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Bohan Hou, Jiuning Gu, Jiayan Guo, Ronghao Dang, Sicong Leng, Xin Li · May 8, 2026 · Citations: 0

Automatic Metrics Coding
  • We introduce InterLV-Search, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search.
  • Experiments on proprietary and open-source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence…
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang, Kunyu Shi · Apr 9, 2026 · Citations: 0

Automatic Metrics General
  • The advent of agentic multimodal models has empowered systems to actively interact with external environments.
  • Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.
AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning

Yuanfu Sun, Kang Li, Dongzhe Fan, Jiajin Liu, Qiaoyu Tan · Apr 7, 2026 · Citations: 0

Automatic Metrics Coding
  • To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology-aware navigation and LLM-based inference.
  • Specifically, we propose AgentGL, the first reinforcement learning (RL)-driven framework for AGL.
Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

Guan-Ting Lin, Chen Chen, Zhehuai Chen, Hung-yi Lee · Apr 6, 2026 · Citations: 0

Automatic Metrics General
  • We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use.
  • Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains.
CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

Cheng Qian, Hyeonjeong Ha, Jiayu Liu, Jeonghwan Kim, Jiateng Liu, Bingxuan Li · Apr 6, 2026 · Citations: 0

General
  • As a first step, we introduce CreativityBench, a benchmark for evaluating affordance-based creativity in LLMs.
  • Evaluations across 10 state-of-the-art LLMs, including closed and open-source models, show that models can often select a plausible object, but fail to identify the correct parts, their affordances, and the underlying physical mechanism…
Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents

Xuan Qi · Apr 2, 2026 · Citations: 0

Automatic Metrics General
  • Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood.
  • We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark.
HippoCamp: Benchmarking Contextual Agents on Personal Computers

Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen, Yichi Zhang · Apr 1, 2026 · Citations: 0

Automatic Metrics Medicine
  • We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management.
  • We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp.
Agentic Tool Use in Large Language Models

Jinchao Hu, Meizhi Zhong, Kehai Chen, Xuefeng Bai, Min Zhang · Apr 1, 2026 · Citations: 0

General
  • Large language models are increasingly being deployed as autonomous agents yet their real world effectiveness depends on reliable tools for information retrieval, computation and external action.
  • This paper organizes the literature into three paradigms: prompting as plug-and-play, supervised tool learning and reward-driven tool policy learning, analyzes their methods, strengths and failure modes, reviews the evaluation landscape and…
Cognitive Friction: A Decision-Theoretic Framework for Bounded Deliberation in Tool-Using Agents

Davide Di Gioia · Mar 31, 2026 · Citations: 0

Automatic Metrics General
  • Autonomous tool-using agents in networked environments must decide which information source to query and when to stop querying and act.
  • Without principled bounds on information-acquisition costs, unconstrained agents exhibit systematic failure modes: excessive tool use under congestion, prolonged deliberation under time decay, and brittle behavior under ambiguous evidence.
The Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle

Lara Russell-Lasalandra, Hudson Golino, Luis Eduardo Garrido, Alexander P. Christensen · Mar 30, 2026 · Citations: 0

Critique Edit Coding
  • Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin.
  • The `AIGENIE` R package implements the AI-GENIE framework (Automatic Item Generation with Network-Integrated Evaluation), which integrates large language model (LLM) text generation with network psychometric methods to automate the early…
FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol

Jie Zhu, Yimin Tian, Boyang Li, Kehao Wu, Zhongzhi Liang, Junhui Li · Mar 26, 2026 · Citations: 0

Automatic Metrics General
  • This paper introduces FinMCP-Bench, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols.
  • Using this benchmark, we systematically assess a range of mainstream LLMs and propose metrics that explicitly measure tool invocation accuracy and reasoning capabilities.
VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents

Yuhao Chen, Yi Xu, Xinyun Ding, Xiang Fang, Shuochen Liu, Luxi Lin · Mar 25, 2026 · Citations: 0

Pairwise Preference Simulation Env Coding
  • With the growing demand for intelligent in-vehicle experiences, vehicle-based agents are evolving from simple assistants to long-term companions.
  • To address this gap, we introduce VehicleMemBench, a multi-user long-context memory benchmark built on an executable in-vehicle simulation environment.
PaperVoyager : Building Interactive Web with Visual Language Models

Dasen Dai, Biao Wu, Meng Fang, Wenhao Wang · Mar 24, 2026 · Citations: 0

General
  • In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems.
  • To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth.
Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.