Matched via arXiv identifier search · Strong overlap with paper title keywords
- Stars
- 57
- Last push
- May 7, 2026 (42d ago)
Risk flags
- No tagged releases
- No Docker setup
Xirui Li, Ming Li, Derry Xu, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh, Tianyi Zhou
Paper appears method- or tooling-adjacent to AI workflows with partial ecosystem coverage.
Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The ...
pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.
No concrete benchmark grounding is available yet. Treat the page as context or an implementation starting point only.
Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale.
xirui-li/ClawEnvKit is the best available implementation candidate based on ranking signals, but recommendation confidence is not yet high. CI workflows are present. License is declared (MIT).
Open xirui-li/ClawEnvKitEvidence graph: 4 refs, 4 links.
Utility signals: depth 55/100, grounding 85/100, status medium.
Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.
Matched via arXiv identifier search · Strong overlap with paper title keywords
Risk flags
Open-source Environment toolkit of claw-like agents, support task/harness generation and evaluation
Ready to reproduce
Quick start
git clone https://github.com/xirui-li/ClawEnvKit.git
pip install -e . No benchmark numbers could be verified. You will not be able to validate reproduction correctness against published numbers.
No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.
Broaden model search
No trustworthy dataset matches right now.
Search datasets on Hugging FaceNo trustworthy demo spaces right now.
Search spaces on Hugging FaceTasks
Agentic tool use
Methods
Agentic systems
Domains
AI Agents
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.