Skip to content
← Back to explorer

HFEPX Hub

CS.CV + Simulation Env Papers

Updated from current HFEPX corpus (Feb 27, 2026). 12 papers are grouped in this hub page. Common evaluation modes: Simulation Env, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: APPS. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 12 Last published: Feb 25, 2026 Global RSS Tag RSS
Cs.CVSimulation Env

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 12 papers for CS.CV + Simulation Env Papers. Dominant protocol signals include simulation environments, automatic metrics, with frequent benchmark focus on APPS, Vbvr-Bench and metric focus on accuracy, success rate. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • APPS appears in 8.3% of hub papers (1/12); use this cohort for benchmark-matched comparisons.
  • Vbvr-Bench appears in 8.3% of hub papers (1/12); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • accuracy is reported in 16.7% of hub papers (2/12); compare with a secondary metric before ranking methods.
  • success rate is reported in 16.7% of hub papers (2/12); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (16.7% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
  • Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (16.7% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (58.3% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (8.3% vs 35% target).
  • Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (25% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (16.7% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (16.7% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (58.3% vs 35% target).

Papers with known rater population

Coverage is a replication risk (8.3% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (25% vs 35% target).

Suggested Reading Order

  1. 1. Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. A Very Big Video Reasoning Suite

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

    Adds simulation environments for broader coverage within this hub.

  5. 5. UI-Venus-1.5 Technical Report

    Adds simulation environments for broader coverage within this hub.

  6. 6. Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

    Adds simulation environments with pairwise preferences for broader coverage within this hub.

  7. 7. Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

    Adds simulation environments for broader coverage within this hub.

  8. 8. BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning

    Adds automatic metrics with pairwise preferences for broader coverage within this hub.

Known Limitations

  • Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (8.3% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

simulation_env vs automatic_metrics

both=3, left_only=9, right_only=0

3 papers use both Simulation Env and Automatic Metrics.

Benchmark Brief

APPS

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention APPS.

Examples: UI-Venus-1.5 Technical Report

Benchmark Brief

Vbvr-Bench

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention Vbvr-Bench.

Examples: A Very Big Video Reasoning Suite

Benchmark Brief

Venusbench

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention Venusbench.

Examples: UI-Venus-1.5 Technical Report

Metric Brief

success rate

Coverage: 2 papers (16.7%)

2 papers (16.7%) mention success rate.

Examples: Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

Metric Brief

cost

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention cost.

Examples: Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

Top Papers

Related Hubs