Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Xu Tang, Yao Hu, Philip Torr, Wanli Ouyang, Shaosheng Cao · Feb 2, 2026 · Citations: 0

Automatic Metrics Coding Expert Verification Web Browsing

Open arXiv Find Implementation RSS feed Shortlist (0)

Data freshness

Extraction: Fresh

Check recency before relying on this page for active eval decisions. Use stale pages as context and verify against current hub results.

Metadata refreshed

Feb 28, 2026, 9:34 PM

Recent

Extraction refreshed

Mar 13, 2026, 7:22 PM

Fresh

Extraction source

Persisted extraction

Confidence 0.85

Abstract

Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

HFEPX Relevance Assessment

This paper has strong direct human-feedback and evaluation protocol signal and is suitable as a primary eval pipeline reference.

Best use

Primary protocol reference for eval design

Use if you need

A concrete protocol example with enough signal to inform rater workflow design.

Main weakness

No major weakness surfaced.

Trust level

High

Eval-Fit Score

75/100 • High

Use this as a primary source when designing or comparing eval protocols.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

Extraction confidence: High

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Field Provenance & Confidence

Each key protocol field shows extraction state, confidence band, and data source so you can decide whether to trust it directly or validate from full text.

Human Feedback Types

strong

Expert Verification

Confidence: High Source: Persisted extraction evidenced

Directly usable for protocol triage.

Evidence snippet: Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding.

Evaluation Modes

strong

Automatic Metrics

Confidence: High Source: Persisted extraction evidenced

Includes extracted eval setup.

Evidence snippet: Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding.

Quality Controls

strong

Adjudication

Confidence: High Source: Persisted extraction evidenced

Calibration/adjudication style controls detected.

Evidence snippet: Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding.

Benchmarks / Datasets

strong

Vdr Bench

Confidence: High Source: Persisted extraction evidenced

Useful for quick benchmark comparison.

Evidence snippet: To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances.

Reported Metrics

missing

Not extracted

Confidence: Low Source: Persisted extraction missing

No metric anchors detected.

Evidence snippet: Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding.

Rater Population

strong

Domain Experts

Confidence: High Source: Persisted extraction evidenced

Helpful for staffing comparability.

Evidence snippet: All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions.

Human Data Lens

Uses human feedback: Yes
Feedback types: Expert Verification
Rater population: Domain Experts
Unit of annotation: Unknown
Expertise required: Coding
Extraction source: Persisted extraction

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: Web Browsing
Quality controls: Adjudication
Confidence: 0.85
Flags: runtime_fallback_extraction

Protocol And Measurement Signals

Benchmarks / Datasets

Vdr-Bench

Reported Metrics

No metric terms were extracted from the available abstract.

Research Brief

Deterministic synthesis

However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. HFEPX signals include Expert Verification, Automatic Metrics, Web Browsing with confidence 0.85. Updated from current HFEPX corpus.

Generated Mar 13, 2026, 7:22 PM · Grounded in abstract + metadata only

Key Takeaways

However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations.
First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be…

Researcher Actions

Compare its human-feedback setup against pairwise and rubric hubs.
Cross-check benchmark overlap: Vdr-Bench.
Verify metric definitions before comparing against your eval pipeline.

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Extraction confidence is probabilistic and should be validated for critical decisions.

Recommended Queries

human-eval protocol design agent eval benchmark comparison adjudication reporting patterns

Research Summary

Contribution Summary

However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations.
First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs.
Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow.

Why It Matters For Eval

However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations.
First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs.

Researcher Checklist

Pass: Human feedback protocol is explicit

Detected: Expert Verification
Pass: Evaluation mode is explicit

Detected: Automatic Metrics
Pass: Quality control reporting appears

Detected: Adjudication
Pass: Benchmark or dataset anchors are present

Detected: Vdr-Bench
Gap: Metric reporting is present

No metric terms extracted.

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Data freshness

Abstract

HFEPX Relevance Assessment

Field Provenance & Confidence

Human Feedback Types

Evaluation Modes

Quality Controls

Benchmarks / Datasets

Reported Metrics

Rater Population

Human Data Lens

Evaluation Lens

Protocol And Measurement Signals

Benchmarks / Datasets

Reported Metrics

Research Brief

Key Takeaways

Researcher Actions

Caveats

Recommended Queries

Research Summary

Contribution Summary

Why It Matters For Eval

Researcher Checklist

Related Papers