Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving

Q: What is the best open-source implementation of "Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving"?

The best maintained implementation is autonomousvision/carla_garage with 544 stars on GitHub. Confidence: high. Reproducibility: Moderate.

Q: How reproducible is "Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving"?

Estimated time to first reproduction: a few hours. Risk flags: No CI workflows detected. Start with autonomousvision/carla_garage and validate setup instructions in README.

Q: What framework is used to implement "Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving"?

The primary implementation uses pytorch.

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, Junchi Yan

Published: Jun 6, 2024

Best maintained implementation now

Evidence: Direct

Domain fit: AI-adjacent

Verified repos: 4

Top repo stars: 544

Paper appears method- or tooling-adjacent to AI workflows with partial ecosystem coverage.

Framework: pytorch

Time to first repro: a few hours

1 risk flag

arXiv PDF

In an era marked by the rapid scaling of foundation models, autonomous driving technologies are approaching a transformative threshold where end-to-end autonomous driving (E2E-AD) emerges due to its potential of scaling up in the data-driven manner. However, existing E2E-AD methods are mostly evaluated under the open-loop log-replay manner with L2 errors and collision rate as metrics (e.g., in nuScenes), which could ...

Read full abstract

not fully reflect the driving performance of algorithms as recently acknowledged in the community. For those E2E-AD methods evaluated under the closed-loop protocol, they are tested in fixed routes (e.g., Town05Long and Longest6 in CARLA) with the driving score as metrics, which is known for high variance due to the unsmoothed metric function and large randomness in the long route. Besides, these methods usually collect their own data for training, which makes algorithm-level fair comparison infeasible. To fulfill the paramount need of comprehensive, realistic, and fair testing environments for Full Self-Driving (FSD), we present Bench2Drive, the first benchmark for evaluating E2E-AD systems' multiple abilities in a closed-loop manner. Bench2Drive's official training data consists of 2 million fully annotated frames, collected from 13638 short clips uniformly distributed under 44 interactive scenarios (cut-in, overtaking, detour, etc), 23 weathers (sunny, foggy, rainy, etc), and 12 towns (urban, village, university, etc) in CARLA v2. Its evaluation protocol requires E2E-AD models to pass 44 interactive scenarios under different locations and weathers which sums up to 220 routes and thus provides a comprehensive and disentangled assessment about their driving capability under different situations. We implement state-of-the-art E2E-AD models and evaluate them in Bench2Drive, providing insights regarding current status and future directions.

Technical details

Canonical key: arxiv-2406.03877

Cache status: Fresh

Generated at: Jun 17, 2026, 10:42 PM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: Yes

LLM status: not_generated

LLM model: n/a

LLM generated: Unknown

LLM content type: n/a

HF policy: hf-relevance-v27

implementation starting point

Benchmarks: missing

Time to repro: a few hours

1 risk flag

pytorch

Results & Benchmarks

Freshness tier: cold

Direct + Inferred Evidence

No concrete benchmark grounding is available yet. Treat the page as context or an implementation starting point only.

Use This Implementation Because…

Confidence: high

autonomousvision/carla_garage is the strongest maintained implementation based on ranking signals. License is declared (MIT). Dependency/environment manifests are present.

Open autonomousvision/carla_garage

Reproduction Risks

No CI workflows detected

Evidence disclosure

Evidence graph: 3 refs, 3 links.

Utility signals: depth 55/100, grounding 75/100, status medium.

Implementation Comparison

Top 3 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

autonomousvision/carla_garage

best maintained

Maintenance: Recently updated

Confidence: High

Reproducibility: Moderate

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 544
Last push: Dec 27, 2025 (172d ago)

Dependencies

Risk flags

No CI pipeline detected
No tagged releases
No Docker setup

Thinklab-SJTU/Bench2Drive

historical official

Maintenance: Active

Confidence: High

Reproducibility: Limited

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 1,888
Last push: May 22, 2026 (27d ago)

Risk flags

No CI pipeline detected
No tagged releases
No Docker setup

thinklab-sjtu/bench2drivezoo

alternative

Maintenance: Stale

Confidence: High

Reproducibility: Moderate

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 392
Last push: Dec 2, 2024 (563d ago)

Dependencies

Risk flags

No push in 12+ months
No CI pipeline detected
No tagged releases

Best implementation now

autonomousvision/carla_garage

Confidence: High

Reproducibility: Moderate

[ICCV'23] Hidden Biases of End-to-End Driving Models & A starter kit for the CARLA leaderboard 2.0.

Stars: 544

Forks: 59

Last push: Dec 27, 2025

License: MIT

Official implementation from Papers with Code

Repository link is mentioned in the paper metadata

Partial overlap with paper title keywords

Community adoption signal (544 stars)

License ✓

CI –

Deps ✓

Docker –

Selected autonomousvision/carla_garage as the strongest maintained implementation for new work.
Includes dependency/environment manifest signals.
Repository activity is within the last 24 months.
Official repository is preserved separately as historical context.

Historical official implementation

Preserved for provenance. Not recommended as the default path for new builds.

Thinklab-SJTU/Bench2Drive

Stars: 1,888

Last push: May 22, 2026

Reproduction readiness

Setup Required

Time to first repro: hours

Last checked: Jun 17, 2026

Dependencies pinned, manual setup needed

· autonomousvision/carla_garage has environment.yml but requires manual environment setup.
· No Dockerfile — you will set up the environment manually.
· No CI pipeline — test coverage is unknown.

Open autonomousvision/carla_garage

Quick start

git clone https://github.com/autonomousvision/carla_garage.git
conda env create -f environment.yml && conda activate <env-name>

No benchmark numbers could be verified. You will not be able to validate reproduction correctness against published numbers.

Additional implementations

Official

thinklab-sjtu/bench2drivezoo
Confidence: High

BEVFormer, UniAD, VAD in Closed-Loop CARLA Evaluation with World Model RL Expert Think2Drive

Stars: 392

Forks: 61

Last push: Dec 2, 2024

License: NOASSERTION

Community

Thinklab-SJTU/Bench2DriveZoo
Confidence: Medium

BEVFormer, UniAD, VAD in Closed-Loop CARLA Evaluation with World Model RL Expert Think2Drive

Stars: 392

Last push: Dec 2, 2024

License: NOASSERTION

Possible but unverified matches (5)

These repositories had low-confidence matching signals and are hidden by default.

RenzKa/simlingo

Confidence: Low

Stars: 420
jichengzh/trb

Confidence: Low

Stars: 0
wzh506/Bench2DriveZoo-Cot4AD

Confidence: Low

Stars: 0
Thinklab-SJTU/Bench2Drive-Jittor

Confidence: Low

Stars: 5
autodriving-heart/NeurIPS2024-Papers-about-Autonomous-Driving

Confidence: Low

Stars: 19

Hugging Face artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.

Models

No trustworthy model matches right now.

Search models on Hugging Face

Datasets

SeanWu25/NEJM-AI_Benchmarking_Medical_Language_Models

Curated Related

Downloads: 109

Likes: 10

Updated: Nov 16, 2023

Broaden dataset search

Autonomous driving dataset bench2drive multi ability benchmarking dataset

Spaces

HuggingFaceTB/smolvlm-web-benchmarking-all

Curated Related

Likes: 4

Broaden demo search

Autonomous driving demo bench2drive multi ability benchmarking demo

Explore on Hugging Face

Search models Search datasets Search spaces

Research context

Tasks

Autonomous driving

Methods

None detected

Domains

Autonomous Driving

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Explore Similar Papers

Jump to Paper2Code search queries derived from this paper's research context.

Autonomous driving Autonomous Driving

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote