Are there pretrained models available for "Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks"?

Yes, 1 Hugging Face model found. The top result is S4MPL3BI4S/gemma4-e4b-openclaw-agent-lora with 4 downloads.

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Q: How reproducible is "Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks"?

Estimated time to first reproduction: a few days. Risk flags: No repository-level reproducibility signals are currently available, Estimate assumes artifact-level reproduction; full training reproduction may require additional paper details.. Use the paper-linked Hugging Face release as the starting artifact, then reconstruct training and evaluation settings from the paper.

Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, Yuchuan Tian, Wei He, Hang Zhou, Jianyuan Guo, Hailin Hu, Lin Ma, Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei, Yunhe Wang, Yu Wang

Published: Jun 10, 2026

No direct implementation yet

Evidence: Inferred

Domain fit: AI-adjacent

Verified repos: 0

Paper appears method- or tooling-adjacent to AI workflows with partial ecosystem coverage.

Time to first repro: a few days

2 risk flags

arXiv PDF

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, ...

Read full abstract

comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only $19.1\%$ Pass@1, whereas the full adapter reaches $73.4\%$ with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw $\times$ nine-model sweep and a five-claw $\times$ two-model sweep, model choice changes Pass@1 by $29.4$ pp and harness choice by $27.4$ pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.

Technical details

Canonical key: arxiv-2606.12344

Cache status: Fresh

Generated at: Jun 20, 2026, 6:21 AM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: No

LLM status: blocked (invalid_model_output:researcher_extraction:OpenRouter request failed)

LLM model: openai/gpt-5.1

LLM generated: Jun 16, 2026, 6:06 AM

LLM content type: researcher_benchmark_brief

HF policy: hf-relevance-v27

LLM evidence refs: paper.title, summary.hasReliableImplementation

context only

Benchmarks: thin evidence

Time to repro: a few days

2 risk flags

Results & Benchmarks

Freshness tier: hot

Direct + Inferred Evidence

Some benchmark signal exists in the extracted evidence, but it is not structured strongly enough yet for a confident benchmark decision.

Implementation Evidence Summary

Confidence: low

No direct maintained repository implementation was found, but paper-linked Hugging Face artifacts are available.

Reproduction Risks

Estimate assumes artifact-level reproduction; full training reproduction may require additional paper details.

Hardware Notes

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Evidence disclosure

LLM evidence refs: paper.title, summary.hasReliableImplementation

Evidence graph: 3 refs, 2 links.

Utility signals: depth 80/100, grounding 68/100, status medium.

Implementation Status

No verified maintained repo

There is no verified maintained implementation yet. Use this baseline plan to decide whether to prototype now or defer.

Use the paper-linked Hugging Face release as the starting artifact, then reconstruct training and evaluation settings from the paper.
No direct maintained implementation was found. Use the paper PDF and citation graph to design a baseline reproduction.
Track assumptions and missing details in an experiment log before coding.

Time to first repro: a few days

Best available artifact: S4MPL3BI4S/gemma4-e4b-openclaw-agent-lora