HFEPX Hub

CS.CV + Long Horizon Papers

Updated from current HFEPX corpus (Feb 27, 2026). 11 papers are grouped in this hub page. Common evaluation modes: Simulation Env, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: APPS. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 11 Last published: Feb 25, 2026 Global RSS Tag RSS

Cs.CVLong Horizon

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 11 papers for CS.CV + Long Horizon Papers. Dominant protocol signals include simulation environments, automatic metrics, with frequent benchmark focus on APPS, MATH and metric focus on accuracy, success rate. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

18.2% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning , BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning , Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
simulation environments appears in 63.6% of papers in this hub.

Evidence: Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies , UI-Venus-1.5 Technical Report , Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
APPS is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: UI-Venus-1.5 Technical Report , Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies , Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies , Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs , Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: UI-Venus-1.5 Technical Report , Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies , Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
Stratify by benchmark (APPS vs MATH) before comparing methods.

Evidence: Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies , Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs , Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Benchmark Interpretation

APPS appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.
MATH appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 27.3% of hub papers (3/11); compare with a secondary metric before ranking methods.
success rate is reported in 18.2% of hub papers (2/11); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (18.2% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (27.3% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (63.6% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (9.1% vs 35% target).
Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (27.3% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (18.2% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (27.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (63.6% vs 35% target).

Papers with known rater population

Coverage is a replication risk (9.1% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (27.3% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.1% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: APPS - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

simulation_env vs automatic_metrics

both=1, left_only=6, right_only=4

1 papers use both Simulation Env and Automatic Metrics.

Benchmark Brief

APPS

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention APPS.

Examples: UI-Venus-1.5 Technical Report

Benchmark Brief

MATH

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention MATH.

Examples: MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts

Benchmark Brief

Retrieval

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention Retrieval.

Examples: VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Metric Brief

accuracy

Coverage: 3 papers (27.3%)

3 papers (27.3%) mention accuracy.

Examples: Classroom Final Exam: An Instructor-Tested Reasoning Benchmark , VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval , BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning

Metric Brief

success rate

Coverage: 2 papers (18.2%)

2 papers (18.2%) mention success rate.

Examples: Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

Metric Brief

cost

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention cost.

Examples: Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies , Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li · Feb 25, 2026 · Citations: 0

Simulation Env Long Horizon

Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding · Feb 25, 2026 · Citations: 0

Simulation Env Long Horizon

We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidat
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song · Feb 23, 2026 · Citations: 0

Automatic Metrics Long Horizon

We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains.
VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026 · Citations: 0

Automatic Metrics Long Horizon

Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
UI-Venus-1.5 Technical Report
Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu · Feb 9, 2026 · Citations: 0

Simulation Env Long Horizon

GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026 · Citations: 0

Pairwise Preference Simulation Env Long Horizon

Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu · Dec 9, 2025 · Citations: 0

Simulation Env Long Horizon

Extensive experiments on the AerialVLN and OpenFly benchmark validate the effectiveness of our method.
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025 · Citations: 0

Pairwise Preference Automatic MetricsSimulation Env Long Horizon

Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs.
World Simulation with Video Foundation Models for Physical AI
NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala · Oct 28, 2025 · Citations: 0

Simulation Env Long Horizon

These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts
Hao Liang, Linzhuang Sun, Minxuan Zhou, Zirong Chen, Meiyi Qiang · Aug 14, 2024 · Citations: 0

Automatic Metrics Long Horizon

While existing benchmarks such as MathVista and MathVerse have advanced the evaluation of multimodal math proficiency, they primarily rely on digitally rendered content and fall short in capturing the complexity of real-world scenarios.

CS.CV + Long Horizon Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs