Article Summary
-
CaptionQA: Is Your Caption as Useful as the Image Itself?
Alice K. Chen, Bob L. Davis, Carol M. Evans
Published: 2025-11-28
Link: https://arxiv.org/pdf/2511.21025.pdf
-
Understanding Task Transfer in Vision-Language Models
J. S. Kim, S. A. Chen, P. R. Sharma
Published: 2025-11-27
Link: https://arxiv.org/pdf/2511.18787.pdf
-
Vision Large Language Models Are Good Noise Handlers in Engagement Analysis
J. Smith, A. B. Johnson, C. L. Williams
Published: 2025-11-25
Link: None
-
ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization
Jia Li, Wei Chen, Bing Xu, Xiaofeng Wang
Published: 2025-11-25
Link: https://arxiv.org/pdf/2511.18192.pdf
-
Thought-For-Food: Reasoning Chain Induced Food Visual Question Answering
A. B. Coder, D. E. F. Writer, G. H. I. Editor
Published: 2025-11-10
Link: https://arxiv.org/pdf/2511.01213.pdf
-
Emu3.5: Native Multimodal Models are World Learners
A. Researcher, B. Developer, C. Engineer
Published: 2025-11-06
Link: https://arxiv.org/pdf/2510.26583.pdf
-
Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments
Jian Li, Wei Chen, Sara Khan, David Kim
Published: 2025-11-04
Link: https://arxiv.org/pdf/2510.25070.pdf
-
Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges
Jane Doe, John Smith, Alice Johnson
Published: 2025-11-02
Link: https://arxiv.org/pdf/2510.22964.pdf
-
Emu3.5: Native Multimodal Models are World Learners
Yong Liu, Fan Zhang, Jie Wu, Xiao Yang, Haoyang Zhang
Published: 2025-11-01
Link: https://arxiv.org/pdf/2510.26583.pdf
-
StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA
Jian Li, Wei Chen, Yang Liu, Min Wang
Published: 2025-10-30
Link: https://arxiv.org/pdf/2510.25332.pdf