What is the best open-source implementation of "REALM-Bench: A Real-World Planning Benchmark for LLMs and Multi-Agent Systems"?

The best maintained implementation is genglongling/realm-bench with 40 stars on GitHub. Confidence: high. Reproducibility: Limited.

What framework is used to implement "REALM-Bench: A Real-World Planning Benchmark for LLMs and Multi-Agent Systems"?

The primary implementation uses none.

REALM-Bench: A Real-World Planning Benchmark for LLMs and Multi-Agent Systems

Q: How reproducible is "REALM-Bench: A Real-World Planning Benchmark for LLMs and Multi-Agent Systems"?

Estimated time to first reproduction: a few hours. Risk flags: License metadata missing, No CI workflows detected. Start with genglongling/realm-bench and validate setup instructions in README.

Published: Feb 1, 2025

Best maintained implementation now

Evidence: Direct

Domain fit: AI-adjacent

Verified repos: 1

Top repo stars: 40

Paper appears method- or tooling-adjacent to AI workflows with partial ecosystem coverage.

Framework: none

Time to first repro: a few hours

2 risk flags

arXiv PDF

Technical details

Canonical key: arxiv-2502.18836

Cache status: Stale (SWR served)

Generated at: Apr 29, 2026, 8:53 PM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: Yes

LLM status: not_generated

LLM model: n/a

LLM generated: Unknown

LLM content type: n/a

HF policy: hf-relevance-v27

implementation starting point

Benchmarks: thin evidence

Time to repro: a few hours

2 risk flags

none

Results & Benchmarks

Freshness tier: hot

Direct + Inferred Evidence

Generation

abz07

Dynamic Gap.

0.46

Source: paper fulltext

Benchmark evidence drill-down

1 findings

Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.

Task	Dataset	Metric	Value	Source	Evidence refs
Generation	abz07	Dynamic Gap.	0.46	paper-derived	No explicit refs

REALM-Bench: A Real-World Planning Benchmark for LLMs and Multi-Agent Systems focuses on agentic tool use.

Use This Implementation Because…

Confidence: high

genglongling/realm-bench is the strongest maintained implementation based on ranking signals. Dependency/environment manifests are present.

Open genglongling/realm-bench

Reproduction Risks

License metadata missing
No CI workflows detected

Evidence disclosure

Evidence graph: 3 refs, 3 links.

Utility signals: depth 95/100, grounding 85/100, status high.

Implementation Comparison

Top 2 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

genglongling/realm-bench

best maintained

Maintenance: Recently updated

Confidence: High

Reproducibility: Limited

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 40
Last push: Dec 31, 2025 (121d ago)

Dependencies

Risk flags

No CI pipeline detected
No tagged releases
No Docker setup

genglongling/SagaLLM

alternative

Maintenance: Stale

Confidence: Low

Reproducibility: Limited

Strong overlap with paper title keywords · Community adoption signal (39 stars)

Stars: 39
Last push: Mar 21, 2025 (405d ago)

Dependencies

Risk flags

No push in 12+ months
No CI pipeline detected
No tagged releases

Best implementation now

genglongling/realm-bench

Confidence: High

Reproducibility: Limited

REALM-Bench: A Real-World Planning Benchmark for LLMs and Multi-Agent Systems

Stars: 40

Forks: 6

Last push: Dec 31, 2025

Official implementation from Papers with Code

Repository link is mentioned in the paper metadata

Strong overlap with paper title keywords

Community adoption signal (40 stars)

License –

CI –

Deps ✓

Docker –

Selected genglongling/realm-bench as the strongest maintained implementation for new work.
Includes dependency/environment manifest signals.
Repository activity is within the last 24 months.

Reproduction readiness

Setup Required

Time to first repro: hours

Last checked: Apr 29, 2026

Dependencies pinned, manual setup needed

· genglongling/realm-bench has requirements.txt but requires manual environment setup.
· No Dockerfile — you will set up the environment manually.
· No CI pipeline — test coverage is unknown.

Open genglongling/realm-bench

Quick start

git clone https://github.com/genglongling/realm-bench.git
pip install -r requirements.txt

Additional implementations

No additional verified repositories beyond the primary recommendation.

Possible but unverified matches (1)

These repositories had low-confidence matching signals and are hidden by default.

genglongling/SagaLLM

Confidence: Low

Stars: 39

Hugging Face artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.

Models

No trustworthy model matches right now.

Search models on Hugging Face

Datasets

No trustworthy dataset matches right now.

Search datasets on Hugging Face

Spaces

nabeelsidd/meal-planning-agent

Curated Related

Likes: 1

Broaden demo search

arxiv:2502.18836 REALM-Bench demo

Explore on Hugging Face

Search models Search datasets Search spaces

Research context

Tasks

Agentic tool use

Methods

Agentic systems

Domains

AI Agents

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Explore Similar Papers

Jump to Paper2Code search queries derived from this paper's research context.

Agentic tool use Agentic systems AI Agents

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote