LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding

Zhivar Sourati, Zheng Wang, Marianne Menglin Liu, Yazhe Hu, Mengqing Guo, Sujeeth Bharadwaj, Kyu Han, Tao Sheng, Sujith Ravi, Morteza Dehghani, Dan Roth · Oct 8, 2025 · Citations: 0

Automatic Metrics

Open arXiv RSS feed

Abstract

Question answering over visually rich documents (VRDs) requires reasoning not only over isolated content but also over documents' structural organization and cross-page dependencies. However, conventional retrieval-augmented generation (RAG) methods encode content in isolated chunks during ingestion, losing structural and cross-page dependencies, and retrieve a fixed number of pages at inference, regardless of the specific demands of the question or context. This often results in incomplete evidence retrieval and degraded answer quality for multi-page reasoning tasks. To address these limitations, we propose LAD-RAG, a novel Layout-Aware Dynamic RAG framework. During ingestion, LAD-RAG constructs a symbolic document graph that captures layout structure and cross-page dependencies, adding it alongside standard neural embeddings to yield a more holistic representation of the document. During inference, an LLM agent dynamically interacts with the neural and symbolic indices to adaptively retrieve the necessary evidence based on the query. Experiments on MMLongBench-Doc, LongDocURL, DUDE, and MP-DocVQA demonstrate that LAD-RAG improves retrieval, achieving over 90% perfect recall on average without any top-k tuning, and outperforming baseline retrievers by up to 20% in recall at comparable noise levels, yielding higher QA accuracy with minimal latency.

HFEPX Relevance Assessment

This paper appears adjacent to HFEPX scope (human-feedback/eval), but does not show strong direct protocol evidence in metadata/abstract.

Eval-Fit Score

5/100 • Low

Treat as adjacent context, not a core eval-method reference.

Human Feedback Signal

Not explicit in abstract metadata

Evaluation Signal

Detected

HFEPX Fit

Adjacent candidate

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Human Data Lens

Uses human feedback: No
Feedback types: None
Rater population: Unknown
Unit of annotation: Unknown
Expertise required: General
Extraction source: Runtime deterministic fallback

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: None
Quality controls: Not reported
Confidence: 0.45
Flags: low_signal, possible_false_positive, runtime_fallback_extraction

Protocol And Measurement Signals

Benchmarks / Datasets

DocVQAMmlongbench

Reported Metrics

accuracyrecalllatency

Research Brief

Deterministic synthesis

To address these limitations, we propose LAD-RAG, a novel Layout-Aware Dynamic RAG framework. HFEPX signals include Automatic Metrics with confidence 0.45. Updated from current HFEPX corpus.

Generated Mar 3, 2026, 8:35 PM · Grounded in abstract + metadata only

Key Takeaways

To address these limitations, we propose LAD-RAG, a novel Layout-Aware Dynamic RAG framework.
During inference, an LLM agent dynamically interacts with the neural and symbolic indices to adaptively retrieve the necessary evidence based on the query.

Researcher Actions

Treat this as method context, then pivot to protocol-specific HFEPX hubs.
Cross-check benchmark overlap: DocVQA, Mmlongbench.
Validate metric comparability (accuracy, recall, latency).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Low-signal flag detected: protocol relevance may be indirect.

Recommended Queries

human-eval protocol design pairwise preference data quality inter-rater agreement adjudication

Research Summary

Contribution Summary

To address these limitations, we propose LAD-RAG, a novel Layout-Aware Dynamic RAG framework.
During inference, an LLM agent dynamically interacts with the neural and symbolic indices to adaptively retrieve the necessary evidence based on the query.
Experiments on MMLongBench-Doc, LongDocURL, DUDE, and MP-DocVQA demonstrate that LAD-RAG improves retrieval, achieving over 90% perfect recall on average without any top-k tuning, and outperforming baseline retrievers by up to 20% in recall…

Why It Matters For Eval

During inference, an LLM agent dynamically interacts with the neural and symbolic indices to adaptively retrieve the necessary evidence based on the query.

Researcher Checklist

Gap: Human feedback protocol is explicit

No explicit human feedback protocol detected.
Pass: Evaluation mode is explicit

Detected: Automatic Metrics
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Pass: Benchmark or dataset anchors are present

Detected: DocVQA, Mmlongbench
Pass: Metric reporting is present

Detected: accuracy, recall, latency

Category-Adjacent Papers (Broader Context)

These papers are nearby in arXiv category and useful for broader context, but not necessarily protocol-matched to this paper.

IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation Category Neighbor

Citations: 0 Relevance: 5.00
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy, latency, document)
InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models Category Neighbor

Citations: 0 Relevance: 4.10
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy, latency)
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning Category Neighbor

Citations: 0 Relevance: 2.85
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy)
Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent Category Neighbor

Citations: 0 Relevance: 2.85
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy)
CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era Category Neighbor

Citations: 0 Relevance: 2.85
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy)
Confusion-Aware Rubric Optimization for LLM-based Automated Grading Category Neighbor

Citations: 0 Relevance: 2.85
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy)
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science Category Neighbor

Citations: 0 Relevance: 2.85
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy)
Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems Category Neighbor

Citations: 0 Relevance: 2.85
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (latency)

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote