1. Introduction
Visual Question Answering (VQA) systems aim to understand both visual content and natural language questions to provide accurate answers, typically relying on direct image input. However, using image captions instead of raw images could offer computational advantages and accessibility benefits if captions contained sufficient information. This study explores the critical problem of whether current image captions are semantically rich enough to stand in for images in complex VQA scenarios. We aim to quantify the information gap between images and their textual descriptions for reasoning tasks. Models used: VQA models (e.g., LXMERT, ViLBERT, M4C), Image Captioning models (e.g., Show and Tell, Transformer-based models).
2. Related Work
Previous research in VQA has primarily focused on improving multimodal fusion techniques, often assuming direct access to visual features extracted from images. Concurrently, advancements in image captioning have led to highly descriptive textual outputs, but their information density for specific downstream tasks remains underexplored. Studies on caption quality typically evaluate fluency and relevance to the image, not their sufficiency for complex reasoning tasks like VQA. This work differentiates itself by directly evaluating caption utility for question answering, contrasting it with performance achieved using raw visual input.
3. Methodology
We developed the CaptionQA framework, which consists of two primary stages: caption generation and caption-based VQA. First, a diverse set of images from established VQA datasets is processed through state-of-the-art image captioning models to generate descriptive texts. Next, a VQA model is trained and evaluated using only these generated captions and the corresponding questions, without access to the original images. The performance on caption-based VQA is then directly compared to the baseline performance of the same VQA model operating on raw image features to quantify the 'caption utility gap'.
4. Experimental Results
Our experiments reveal a consistent and significant performance drop when VQA models are constrained to use only image captions compared to utilizing full image features. For instance, on the VQAv2 dataset, caption-only VQA achieved an accuracy of 45.2%, whereas image-based VQA reached 68.7%. The disparity was more pronounced for questions requiring fine-grained object recognition or spatial reasoning, indicating a lack of specific detail in captions.
Table 1: VQA Accuracy Comparison (Image vs. Caption Input)
| Dataset | VQA Model | Image Input Accuracy (%) | Caption Input Accuracy (%) | Difference (%) |
|---|---|---|---|---|
| VQAv2 | LXMERT | 68.7 | 45.2 | 23.5 |
| GQA | ViLBERT | 72.1 | 48.9 | 23.2 |
| OKVQA | M4C | 55.3 | 32.8 | 22.5 |
5. Discussion
The observed performance gap between image-based and caption-based VQA highlights that current image captions, while descriptive, often fail to capture the nuanced visual information essential for complex reasoning tasks. This suggests a need for captioning models that can generate more detailed, context-aware, or task-specific descriptions rather than generic summaries. Future work could focus on incorporating explicit object relationships, spatial attributes, or even visual common sense into caption generation to enhance their utility for downstream tasks like VQA, ultimately aiming to make captions as informative as the images themselves.