HFEPX Hub

CS.CV + Long Horizon Papers

Updated from current HFEPX corpus (Apr 12, 2026). 40 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 40 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: Blenderbench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 14, 2026.

Papers: 40 Last published: Mar 14, 2026 Global RSS Tag RSS

Cs.CVLong Horizon

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (40) Replication-Ready Only (2)

High-Signal Coverage

100.0%

40 / 40 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

2 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

25% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 40% of papers in this hub.
Blenderbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Stratify by benchmark (Blenderbench vs Latentneeds-Bench) before comparing methods.

Benchmark Interpretation

Blenderbench appears in 4.2% of hub papers (1/40); use this cohort for benchmark-matched comparisons.
Latentneeds-Bench appears in 4.2% of hub papers (1/40); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 41.7% of hub papers (10/40); compare with a secondary metric before ranking methods.
latency is reported in 12.5% of hub papers (3/40); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (25% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (12.5% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (83.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8.3% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (50% vs 35% target).

Strengths

Agentic evaluation appears in 100% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.3% coverage).
Benchmark coverage is thin (12.5% of papers mention benchmarks/datasets).

Suggested Next Analyses

Stratify by benchmark (Blenderbench vs Latentneeds-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and latency.

Recommended Queries (Expanded)

Recommended Queries

Benchmark Slice: Blenderbench Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

Highest protocol score with explicit human/eval signal plus Latentneeds-Bench.

Strongest benchmark reference

Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Questio…

Reported benchmark with accuracy gives a fast comparison anchor.

Strongest recent paper

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbali…

Useful for current practice scanning; published Jan 14, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
Apr 9, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Latentneeds Bench · Metric: Precision
Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering
Mar 14, 2026 · Citations: 0 · Score: 5.5

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Jan 14, 2026 · Citations: 0 · Score: 5.5

HF: Pairwise Preference · Eval: Simulation Env · Benchmark: Not Reported · Metric: Latency
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
Jan 16, 2026 · Citations: 0 · Score: 5.5

HF: Not reported · Eval: Automatic Metrics · Benchmark: Blenderbench · Metric: Accuracy
Watch and Learn: Learning to Use Computers from Online Videos
Oct 6, 2025 · Citations: 0 · Score: 5.0

HF: Demonstrations · Eval: Not reported · Benchmark: OSWorld · Metric: Not Reported
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Oct 31, 2025 · Citations: 0 · Score: 5.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory Apr 9, 2026	No Not Reported	Automatic Metrics	Latentneeds Bench	Precision , Latency	Not Reported
Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering Mar 14, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Accuracy	Not Reported
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning Jan 14, 2026	Yes Pairwise Preference	Simulation Env	Not Reported	Latency	Not Reported
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning Jan 16, 2026	No Not Reported	Automatic Metrics	Blenderbench , Slidebench	Accuracy	Not Reported
Watch and Learn: Learning to Use Computers from Online Videos Oct 6, 2025	Yes Demonstrations	Not Reported	OSWorld , Windowsagentarena	Not Reported	Not Reported
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning Oct 31, 2025	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy	Not Reported
SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning Mar 30, 2026	Yes Demonstrations	Simulation Env	Not Reported	Not Reported	Not Reported
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs Feb 13, 2026	Yes Pairwise Preference , Rubric Rating	Not Reported	Not Reported	Not Reported	Not Reported
Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency Apr 2, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy , F1	Not Reported
HippoCamp: Benchmarking Contextual Agents on Personal Computers Apr 1, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning Mar 27, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
Self-Correcting VLA: Online Action Refinement via Sparse World Imagination Feb 25, 2026	No Not Reported	Simulation Env	Not Reported	Success rate , Throughput	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	PASK: Toward Intent-Aware Proactive Agents with Lon…	Step-CoT: Stepwise Visual Chain-of-Thought for Medi…	Fast-ThinkAct: Efficient Vision-Language-Action Rea…
Human Feedback	Not reported	Expert Verification	Pairwise Preference
Evaluation Modes	Automatic Metrics	Automatic Metrics	Simulation Env
Benchmarks	Latentneeds Bench	Not reported	Not reported
Metrics	Precision, Latency	Accuracy	Latency
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Domain Experts	Unknown
Annotation Unit	Unknown	Freeform	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (3)
Demonstrations (2)
Expert Verification (1)
Rubric Rating (1)

Evaluation Modes

Automatic Metrics (16)
Simulation Env (6)

Top Benchmarks

Blenderbench (1)
Latentneeds Bench (1)
OSWorld (1)
Slidebench (1)

Top Metrics

Accuracy (10)
Latency (3)
Cost (2)
F1 (2)

Rater Population Mix

Domain Experts (2)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 15.0% · benchmarks 15.0% · metrics 52.5% · quality controls 0.0%.

Top Papers

Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering
Lin Fan, Yafei Ou, Zhipeng Deng, Pengyu Dai, Hou Chongxian · Mar 14, 2026 · Citations: 0

Expert Verification Automatic Metrics Long Horizon

Benchmark: github.com/hahaha111111/Step-CoT.
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026 · Citations: 0

Pairwise Preference Simulation Env Long Horizon

Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart · Mar 30, 2026 · Citations: 0

Demonstrations Simulation Env Long Horizon

To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL.
Watch and Learn: Learning to Use Computers from Online Videos
Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva · Oct 6, 2025 · Citations: 0

Demonstrations Long Horizon

Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data.
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

We introduce BEAT, the first framework to inject such visual backdoors into VLM-based embodied agents using objects in the environments as triggers.
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian · Feb 13, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Long Horizon

MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities.
Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li · Feb 25, 2026 · Citations: 0

Simulation Env Long Horizon

Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu · Dec 9, 2025 · Citations: 0

Simulation Env Long Horizon

Extensive experiments on the AerialVLN and OpenFly benchmark validate the effectiveness of our method.
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding · Feb 25, 2026 · Citations: 0

Simulation Env Long Horizon

We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
Zhifei Xie, Zongzheng Hu, Fangda Ye, Xin Zhang, Haobo Chai · Apr 9, 2026 · Citations: 0

Automatic Metrics Long Horizon

Prior work remains largely confined to laboratory settings, leaving a clear gap in real-world proactive agent: depth, complexity, ambiguity, precision and real-time constraints.
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li · Jan 16, 2026 · Citations: 0

Automatic Metrics Long Horizon

To address this, we introduce VIGA (Vision-as-Inverse-Graphics Agent), an interleaved multimodal reasoning framework where symbolic logic and visual perception actively cross-verify each other.
EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei · Mar 12, 2026 · Citations: 0

Automatic Metrics Long Horizon

Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.
Video-Based Reward Modeling for Computer-Use Agents
Linxin Song, Jieyu Zhang, Huanxin Sheng, Taiwei Shi, Gupta Rahul · Mar 10, 2026 · Citations: 0

Automatic Metrics Long Horizon

Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction.
Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization
Inha Kang, Eunki Kim, Wonjeong Ryu, Jaeyo Shin, Seungjun Yu · Nov 27, 2025 · Citations: 0

Automatic Metrics Long Horizon

To address this gap, we construct and release the real-world observations and high-resolution CMAQ-OBS dataset for East Asia, reducing regional error by 59.5% and enabling real-time 48-120 hour forecasts critical for public health alerts.
RefTr: Recurrent Refinement of Confluent Trajectories for 3D Vascular Tree Centerlines
Roman Naeem, David Hagerman, Jennifer Alvén, Fredrik Kahl · Nov 25, 2025 · Citations: 0

Automatic Metrics Long Horizon

We further introduce an efficient non-maximum suppression algorithm for spatial tree graphs to merge duplicate branches and extend evaluation metrics to be radius-aware for robust comparison.
Q$^2$: Quantization-Aware Gradient Balancing and Attention Alignment for Low-Bit Quantization
Zhaoyang Wang, Dong Wang · Nov 8, 2025 · Citations: 0

Automatic Metrics Long Horizon

Quantization-aware training (QAT) has achieved remarkable success in low-bit ($\leq$4-bit) quantization for classification networks.
LayerT2V: A Unified Multi-Layer Video Generation Framework
Guangzhao Li, Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo · Aug 6, 2025 · Citations: 0

Automatic Metrics Long Horizon

Text-to-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows.
World Simulation with Video Foundation Models for Physical AI
NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala · Oct 28, 2025 · Citations: 0

Simulation Env Long Horizon

These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency
Payal Fofadiya, Sunil Tiwari · Apr 2, 2026 · Citations: 0

Automatic Metrics Long Horizon

Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation.
HippoCamp: Benchmarking Contextual Agents on Personal Computers
Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen · Apr 1, 2026 · Citations: 0

Automatic Metrics Tool Use

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management.
PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian · Mar 27, 2026 · Citations: 0

Automatic Metrics Long Horizon

We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning.
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen · Mar 2, 2026 · Citations: 0

Automatic Metrics Long Horizon

While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive…
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song · Feb 23, 2026 · Citations: 0

Automatic Metrics Long Horizon

We introduce (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains.
VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026 · Citations: 0

Automatic Metrics Long Horizon

Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks
Wenbo Hu, Xin Chen, Yan Gao-Tian, Yihe Deng, Nanyun Peng · Apr 9, 2026 · Citations: 0

Long Horizon

Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.
TOOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning
Yifei Gong, Xing Wu, Wenda Liu, Kang Tu · Apr 9, 2026 · Citations: 0

Long Horizon

We propose ToolCAD, a novel agentic CAD framework deploying LLMs as tool-using agents for text-to-CAD generation.
Your Pre-trained Diffusion Model Secretly Knows Restoration
Sudarshan Rajagopalan, Vishal M. Patel · Apr 6, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen · Mar 23, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Manifold-Aware Exploration for Reinforcement Learning in Video Generation
Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma · Mar 23, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Motion-o: Trajectory-Grounded Video Reasoning
Bishoy Galoaa, Shayda Moezzi, Xiangyu Bai, Sarah Ostadabbas · Mar 19, 2026 · Citations: 0

Long Horizon

At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning.
TrajMamba: An Ego-Motion-Guided Mamba Model for Pedestrian Trajectory Prediction from an Egocentric Perspective
Yusheng Peng, Gaofeng Zhang, Liping Zheng · Mar 16, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Seeking Physics in Diffusion Noise
Chujun Tang, Lei Zhong, Fangqiang Ding · Mar 15, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning
Chengjun Yu, Xuhan Zhu, Chaoqun Du, Pengfei Yu, Wei Zhai · Mar 10, 2026 · Citations: 0

Long Horizon

Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric…
Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu · Feb 24, 2026 · Citations: 0

Long Horizon

Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: reflection-in-action, where the agent uses test-time scaling to generate and score multiple candidate actions…
UI-Venus-1.5 Technical Report
Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu · Feb 9, 2026 · Citations: 0

Long Horizon

In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications.
\textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation
Weiye Zhu, Zekai Zhang, Xiangchen Wang, Hewei Pan, Teng Wang · Jan 26, 2026 · Citations: 0

Long Horizon

Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions and act coherently in visually rich environments.
Towards Unified World Models for Visual Navigation via Memory-Augmented Planning and Foresight
Yifei Dong, Fengyi Wu, Guangyu Chen, Lingdong Kong, Xu Zhu · Oct 9, 2025 · Citations: 0

Long Horizon

Enabling embodied agents to imagine future states is essential for robust and generalizable visual navigation.
NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions
Haolin Yang, Yuxing Long, Zhuoyuan Yu, Zihan Yang, Minghan Wang · Oct 9, 2025 · Citations: 0

Long Horizon

Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities.
Den-TP: A Density-Balanced Data Curation and Evaluation Framework for Trajectory Prediction
Ruining Yang, Yi Xu, Yun Fu, Lili Su · Sep 25, 2024 · Citations: 0

Long Horizon

However, existing datasets exhibit a strong long-tail distribution in scenario density, where common low-density cases dominate and safety-critical high-density cases are severely underrepresented.
MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts
Hao Liang, Linzhuang Sun, Minxuan Zhou, Zirong Chen, Meiyi Qiang · Aug 14, 2024 · Citations: 0

Long Horizon

While existing benchmarks such as MathVista and MathVerse have advanced the evaluation of multimodal math proficiency, they primarily rely on digitally rendered content and fall short in capturing the complexity of real-world scenarios.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now