SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

HFEPX Relevance Assessment

This paper has direct human-feedback and/or evaluation protocol signal and is likely useful for eval pipeline design.

Eval-Fit Score

25/100 • Low

Treat as adjacent context, not a core eval-method reference.

Human Feedback Signal

Not explicit in abstract metadata

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Protocol And Measurement Signals

Benchmarks / Datasets

No benchmark or dataset names were extracted from the available abstract.

Reported Metrics

success rate

Research Brief

Deterministic synthesis

Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher-level tool… HFEPX signals include Automatic Metrics, Long Horizon with confidence 0.45. Updated from current HFEPX corpus.

Generated Mar 3, 2026, 7:29 AM · Grounded in abstract + metadata only

Key Takeaways

Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but…
However, existing benchmarks mainly measure instance-level success under static tool sets, offering limited insight into agents' ability to acquire such reusable skills.

Researcher Actions

Treat this as method context, then pivot to protocol-specific HFEPX hubs.
Identify benchmark choices from full text before operationalizing conclusions.
Validate metric comparability (success rate).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Extraction confidence is probabilistic and should be validated for critical decisions.

Recommended Queries

human-eval protocol design agent eval benchmark comparison inter-rater agreement adjudication

Research Summary

Contribution Summary

Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher-level tool…
However, existing benchmarks mainly measure instance-level success under static tool sets, offering limited insight into agents' ability to acquire such reusable skills.
Evaluating state-of-the-art agents on SkillCraft, we observe substantial efficiency gains, with token usage reduced by up to 80% by skill saving and reuse.

Why It Matters For Eval

Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher-level tool…
Evaluating state-of-the-art agents on SkillCraft, we observe substantial efficiency gains, with token usage reduced by up to 80% by skill saving and reuse.

Researcher Checklist

Gap: Human feedback protocol is explicit

No explicit human feedback protocol detected.
Pass: Evaluation mode is explicit

Detected: Automatic Metrics
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Gap: Benchmark or dataset anchors are present

No benchmark/dataset anchor extracted from abstract.
Pass: Metric reporting is present

Detected: success rate

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
Self-Correcting VLA: Online Action Refinement via Sparse World Imagination Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
A Geometric Analysis of Small-sized Language Model Hallucinations Protocol Overlap

Citations: 0 Relevance: 3.70 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives Protocol Overlap

Citations: 0 Relevance: 3.70 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering Protocol Overlap

Citations: 0 Relevance: 3.70 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents Protocol Overlap

Citations: 0 Relevance: 3.70 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning Protocol Overlap

Citations: 0 Relevance: 3.70 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG Protocol Overlap

Citations: 0 Relevance: 3.70 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications Protocol Overlap

Citations: 0 Relevance: 3.70 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations Protocol Overlap

Citations: 0 Relevance: 3.70 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
APEX-Agents Protocol Overlap

Citations: 0 Relevance: 3.70 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup