1. Introduction
The increasing volume of digital documents necessitates advanced systems capable of accurately extracting information through natural language queries. While current Document VQA systems provide textual answers, they often lack the ability to precisely localize these answers within the document, hindering trust and verifiability. This work addresses this critical gap by proposing ARIAL, an agentic framework that integrates advanced reasoning with visual grounding to achieve both accurate answers and their precise spatial localization. Models used in the article include GPT-4, Llama 2, various Vision Transformers for OCR and object detection, and custom-trained localization modules.
2. Related Work
Existing research in Document VQA has explored diverse approaches, from encoder-decoder architectures leveraging multimodal transformers to methods integrating knowledge graphs for enhanced reasoning. Recent advancements in agentic AI and Large Language Models (LLMs) have opened new avenues for complex task orchestration and tool utilization in various domains. However, the specific challenge of achieving fine-grained, precise answer localization in Document VQA, especially in an interpretable agentic manner, remains largely unaddressed by prior work, which often focuses on answer generation rather than visual grounding.
3. Methodology
ARIAL operates as a multi-agent framework where an LLM orchestrates a suite of specialized tools, including an OCR tool for text extraction, a layout analysis tool for identifying structural elements, and a visual grounding tool for mapping textual answers to specific regions. Upon receiving a query and a document, the LLM agent iteratively plans and executes a series of steps: document parsing, relevant information extraction, reasoning to formulate an answer, and finally, utilizing the visual grounding tool to pinpoint the answer's exact location. This iterative process ensures robust understanding and precise localization capabilities.
4. Experimental Results
Experiments were conducted on several public benchmarks, including DocVQA, FUNSD, and a newly curated dataset for precise localization. ARIAL consistently demonstrated superior performance over state-of-the-art baselines across metrics such as ANLS, F1-score for answer accuracy, and Intersection over Union (IoU) for localization precision. Notably, ARIAL achieved a significant improvement in localization tasks, highlighting the effectiveness of its agentic architecture and integrated visual tools. The table below presents a summary of key performance metrics, illustrating ARIAL's competitive advantage. This table showcases ARIAL's strong performance compared to established baselines, particularly its notable improvement in localization metrics such as IoU and pixel accuracy, alongside maintaining high ANLS scores for overall VQA accuracy.
5. Discussion
The results indicate that ARIAL's agentic framework effectively addresses the complex challenge of achieving both high-accuracy answers and precise localization in Document VQA. The LLM-driven orchestration of specialized tools allows for dynamic and context-aware processing, outperforming static end-to-end models. This work suggests a promising direction for developing more reliable, transparent, and user-friendly document AI systems, where users can not only receive answers but also verify their provenance directly within the document. Future work will explore extending ARIAL to handle even more complex document types and cross-document reasoning tasks.