ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization

Abstract

This paper introduces ARIAL, a novel agentic framework designed to enhance Document Visual Question Answering (VQA) by providing precise answer localization. ARIAL leverages the power of Large Language Models (LLMs) as agents to orchestrate various tools, including OCR, layout analysis, and visual grounding modules, for comprehensive document understanding. Experiments demonstrate that ARIAL significantly outperforms existing state-of-the-art methods across multiple benchmarks, particularly in tasks requiring accurate spatial localization of answers within complex document layouts. The framework establishes a new paradigm for building robust and interpretable document VQA systems.

1. Introduction

The increasing volume of digital documents necessitates advanced systems capable of accurately extracting information through natural language queries. While current Document VQA systems provide textual answers, they often lack the ability to precisely localize these answers within the document, hindering trust and verifiability. This work addresses this critical gap by proposing ARIAL, an agentic framework that integrates advanced reasoning with visual grounding to achieve both accurate answers and their precise spatial localization. Models used in the article include GPT-4, Llama 2, various Vision Transformers for OCR and object detection, and custom-trained localization modules.

2. Related Work

Existing research in Document VQA has explored diverse approaches, from encoder-decoder architectures leveraging multimodal transformers to methods integrating knowledge graphs for enhanced reasoning. Recent advancements in agentic AI and Large Language Models (LLMs) have opened new avenues for complex task orchestration and tool utilization in various domains. However, the specific challenge of achieving fine-grained, precise answer localization in Document VQA, especially in an interpretable agentic manner, remains largely unaddressed by prior work, which often focuses on answer generation rather than visual grounding.

3. Methodology

ARIAL operates as a multi-agent framework where an LLM orchestrates a suite of specialized tools, including an OCR tool for text extraction, a layout analysis tool for identifying structural elements, and a visual grounding tool for mapping textual answers to specific regions. Upon receiving a query and a document, the LLM agent iteratively plans and executes a series of steps: document parsing, relevant information extraction, reasoning to formulate an answer, and finally, utilizing the visual grounding tool to pinpoint the answer's exact location. This iterative process ensures robust understanding and precise localization capabilities.

4. Experimental Results

Experiments were conducted on several public benchmarks, including DocVQA, FUNSD, and a newly curated dataset for precise localization. ARIAL consistently demonstrated superior performance over state-of-the-art baselines across metrics such as ANLS, F1-score for answer accuracy, and Intersection over Union (IoU) for localization precision. Notably, ARIAL achieved a significant improvement in localization tasks, highlighting the effectiveness of its agentic architecture and integrated visual tools. The table below presents a summary of key performance metrics, illustrating ARIAL's competitive advantage. This table showcases ARIAL's strong performance compared to established baselines, particularly its notable improvement in localization metrics such as IoU and pixel accuracy, alongside maintaining high ANLS scores for overall VQA accuracy.

5. Discussion

The results indicate that ARIAL's agentic framework effectively addresses the complex challenge of achieving both high-accuracy answers and precise localization in Document VQA. The LLM-driven orchestration of specialized tools allows for dynamic and context-aware processing, outperforming static end-to-end models. This work suggests a promising direction for developing more reliable, transparent, and user-friendly document AI systems, where users can not only receive answers but also verify their provenance directly within the document. Future work will explore extending ARIAL to handle even more complex document types and cross-document reasoning tasks.