Skip to content
← Back to explorer

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Xu Tang, Yao Hu, Philip Torr, Wanli Ouyang, Shaosheng Cao · Feb 2, 2026 · Citations: 0

Data freshness

Extraction: Fresh

Check recency before relying on this page for active eval decisions. Use stale pages as context and verify against current hub results.

Metadata refreshed

Feb 28, 2026, 9:34 PM

Recent

Extraction refreshed

Mar 13, 2026, 7:22 PM

Fresh

Extraction source

Persisted extraction

Confidence 0.85

Abstract

Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

HFEPX Relevance Assessment

This paper has strong direct human-feedback and evaluation protocol signal and is suitable as a primary eval pipeline reference.

Best use

Primary protocol reference for eval design

Use if you need

A concrete protocol example with enough signal to inform rater workflow design.

Main weakness

No major weakness surfaced.

Trust level

High

Eval-Fit Score

75/100 • High

Use this as a primary source when designing or comparing eval protocols.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

Extraction confidence: High

Field Provenance & Confidence

Each key protocol field shows extraction state, confidence band, and data source so you can decide whether to trust it directly or validate from full text.

Human Feedback Types

strong

Expert Verification

Confidence: High Source: Persisted extraction evidenced

Directly usable for protocol triage.

Evidence snippet: Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding.

Evaluation Modes

strong

Automatic Metrics

Confidence: High Source: Persisted extraction evidenced

Includes extracted eval setup.

Evidence snippet: Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding.

Quality Controls

strong

Adjudication

Confidence: High Source: Persisted extraction evidenced

Calibration/adjudication style controls detected.

Evidence snippet: Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding.

Benchmarks / Datasets

strong

Vdr Bench

Confidence: High Source: Persisted extraction evidenced

Useful for quick benchmark comparison.

Evidence snippet: To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances.

Reported Metrics

missing

Not extracted

Confidence: Low Source: Persisted extraction missing

No metric anchors detected.

Evidence snippet: Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding.

Rater Population

strong

Domain Experts

Confidence: High Source: Persisted extraction evidenced

Helpful for staffing comparability.

Evidence snippet: All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions.

Human Data Lens

  • Uses human feedback: Yes
  • Feedback types: Expert Verification
  • Rater population: Domain Experts
  • Unit of annotation: Unknown
  • Expertise required: Coding
  • Extraction source: Persisted extraction

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: Web Browsing
  • Quality controls: Adjudication
  • Confidence: 0.85
  • Flags: runtime_fallback_extraction

Protocol And Measurement Signals

Benchmarks / Datasets

Vdr-Bench

Reported Metrics

No metric terms were extracted from the available abstract.

Research Brief

Deterministic synthesis

However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. HFEPX signals include Expert Verification, Automatic Metrics, Web Browsing with confidence 0.85. Updated from current HFEPX corpus.

Generated Mar 13, 2026, 7:22 PM · Grounded in abstract + metadata only

Key Takeaways

  • However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations.
  • First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be…

Researcher Actions

  • Compare its human-feedback setup against pairwise and rubric hubs.
  • Cross-check benchmark overlap: Vdr-Bench.
  • Verify metric definitions before comparing against your eval pipeline.

Caveats

  • Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
  • Extraction confidence is probabilistic and should be validated for critical decisions.

Research Summary

Contribution Summary

  • However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations.
  • First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs.
  • Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow.

Why It Matters For Eval

  • However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations.
  • First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs.

Researcher Checklist

  • Pass: Human feedback protocol is explicit

    Detected: Expert Verification

  • Pass: Evaluation mode is explicit

    Detected: Automatic Metrics

  • Pass: Quality control reporting appears

    Detected: Adjudication

  • Pass: Benchmark or dataset anchors are present

    Detected: Vdr-Bench

  • Gap: Metric reporting is present

    No metric terms extracted.

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.