LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding

Sourati, Zhivar; Wang, Zheng; Liu, Marianne Menglin; Hu, Yazhe; Guo, Mengqing; Bharadwaj, Sujeeth; Han, Kyu; Sheng, Tao; Ravi, Sujith; Dehghani, Morteza; Roth, Dan

LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding

A layout-aware dynamic retrieval framework for complete evidence gathering over visually rich documents.

ACL 2026

Zhivar Sourati^1,2, Zheng Wang¹, Marianne Menglin Liu¹, Yazhe Hu¹, Mengqing Guo¹, Sujeeth Bharadwaj¹, Kyu Han¹, Tao Sheng¹, Sujith Ravi¹, Morteza Dehghani², Dan Roth^1,3

¹Oracle AI, ²University of Southern California, ³University of Pennsylvania

arXiv Citation

Abstract

Question answering over visually rich documents requires reasoning over not only isolated content, but also layout, document structure, and cross-page dependencies. Conventional RAG pipelines encode pages or chunks independently and retrieve a fixed number of items at inference time, often missing the complete evidence needed for multi-page questions.

LAD-RAG addresses this by constructing a symbolic document graph during ingestion and storing it alongside a neural index. At inference time, an LLM agent dynamically interacts with both indices, choosing semantic search, graph-based search, or hybrid retrieval based on the query.

LAD-RAG achieves over 90% perfect recall on average without top-k tuning and improves retrieval and QA across MMLongBench-Doc, LongDocURL, DUDE, and MP-DocVQA.

Why LAD-RAG?

Visually rich documents contain figures, charts, tables, captions, section boundaries, and page-to-page continuations. These cues are often critical evidence, but they are difficult to recover from a purely embedding-based, fixed top-k retriever.

Layout is evidence

Headers, sections, captions, and visual grouping determine what belongs together.

Evidence is distributed

Answers may require content scattered across figures, tables, and multiple pages.

Top-k is not enough

Different questions require different amounts and kinds of supporting evidence.

LAD-RAG complements neural indexing with symbolic document structure and dynamic retrieval.

From Chunks to Documents

Conventional RAG treats a visually rich document as a set of isolated retrieval units. LAD-RAG keeps the same semantic retrieval pathway, but adds an explicit representation of how document elements relate to each other.

Neural index: retrieves semantically similar content.
Document graph: preserves layout, sections, captions, figures, and cross-page links.
Dynamic agent: decides how much evidence to retrieve for each question.

Method

LAD-RAG has two phases: layout-aware ingestion and dynamic inference-time retrieval.

1. Element extraction

An LVLM parses each page into self-contained objects: titles, paragraphs, figures, tables, captions, and metadata.

2. Document graph construction

Nodes represent page elements; edges encode section hierarchy, layout relations, references, and cross-page continuity.

3. Neural-symbolic indexing

LAD-RAG stores both a symbolic graph and a neural index over node summaries for complementary retrieval paths.

4. Agentic retrieval

An LLM agent chooses semantic search, graph filtering, or contextualization depending on the question.

How the Pipeline Works

LAD-RAG shifts the expensive document understanding step offline, then keeps inference lightweight and question-adaptive.

Ingestion: parse each page, extract elements, update memory, and build graph edges.
Storage: keep both symbolic graph structure and neural summaries.
Inference: use semantic search, graph filtering, and contextualization as needed.

Framework overview: offline ingestion builds document representations; inference uses them adaptively.

Case Studies

LAD-RAG is designed for questions where semantic similarity alone is not enough. The agent can first search semantically, then switch to graph operations when the task requires structural evidence.

Symbolic figure retrieval

For questions asking about all charts with a property, LAD-RAG can retrieve all figure nodes first, then contextualize them with nearby layout and captions.

Cross-page continuation

For reference or appendix questions, graph contextualization recovers continuation pages that weak semantic overlap may miss.

Case study: retrieving charts comparing general public with Latinos/Hispanics

Symbolic filtering and contextualization recover distributed chart evidence.

Case study: retrieving a multi-page references section

Graph contextualization recovers a multi-page references section.

Experimental Setup

Datasets

MMLongBench-Doc LongDocURL DUDE MP-DocVQA

Retrieval baselines

BM25 E5-Large BGE-Large ColPali RAPTOR

QA models

InternVL2-8B Pixtral-12B Phi-3.5-Vision GPT-4o

Retrieval is evaluated with Perfect Recall and Irrelevant Pages Ratio; QA is evaluated with binary accuracy.

Results

>90% perfect recall

Average retrieval completeness without top-k tuning.

up to 20% recall improvement

Higher recall at comparable irrelevant-page ratios.

5-8 pts near oracle evidence

End-to-end QA approaches ground-truth evidence performance.

Retrieval performance of LAD-RAG compared to baselines

Retrieval performance across datasets. Red stars show LAD-RAG without top-k tuning.

What Makes LAD-RAG Work?

Ablations show that both graph querying and contextualization improve retrieval. Graph querying contributes the largest gain, while contextualization helps recover structurally related evidence around retrieved nodes.

Method	MMLongBench	LongDocURL
LAD-RAG	0.979	0.895
LAD-RAG w/o contextualization	0.957	0.819
LAD-RAG w/o graph query	0.856	0.809
LAD-RAG w/o both	0.840	0.774
RAPTOR	0.877	0.853
ColPali	0.831	0.791

Values report the ratio of perfect recall to irrelevant page retrievals; higher is better.

End-to-End QA

Better retrieval improves downstream question answering, especially when questions require multiple evidence pages. LAD-RAG consistently narrows the gap between retrieved evidence and oracle ground-truth evidence.

GPT-4o retrieval setting	MMLongBench	LongDocURL	DUDE	MP-DocVQA
Ground-truth evidence	0.696	0.714	0.807	0.895
Retrieving @ 10	0.610	0.622	0.706	0.819
Top-k adjusted	0.593	0.652	0.720	0.833
LAD-RAG	0.625	0.659	0.725	0.829

Scores are QA accuracy. The full paper reports results across InternVL2-8B, Pixtral-12B, Phi-3.5-Vision, and GPT-4o.

Latency

Graph construction happens offline during ingestion. At inference time, the retrieval agent typically uses only a small number of lightweight LLM calls over the pre-built graph and neural index.

Distribution of number of LLM calls per query

Number of agent calls per query.

Distribution of output tokens per query across LLM calls

Generated tokens per retrieval call.

LAD-RAG improves retrieval and QA while keeping inference overhead minimal, because the expensive document understanding is moved to offline ingestion.

Conclusion

LAD-RAG makes RAG document-aware.

VRD questions need evidence across text, layout, figures/tables, and pages.
LAD-RAG builds a symbolic document graph alongside a neural index.
A dynamic agent retrieves evidence based on the question, not a fixed top-k.
Result: more complete retrieval, better QA, and minimal inference overhead.

Better document representation leads to better evidence, and better evidence leads to better answers.

Citation

If you find LAD-RAG useful, please consider citing our paper:

@misc{sourati2026ladrag,
  title={LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding},
  author={Zhivar Sourati and Zheng Wang and Marianne Menglin Liu and Yazhe Hu and Mengqing Guo and Sujeeth Bharadwaj and Kyu Han and Tao Sheng and Sujith Ravi and Morteza Dehghani and Dan Roth},
  year={2026},
  eprint={2510.07233},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2510.07233}
}