Article Summary
-
Thinking with Images via Self-Calling Agent
Li Wei, Chen Jie, Wang Siyu
Published: 2025-12-11
Link: https://arxiv.org/pdf/2512.08511.pdf
-
Towards Cross-View Point Correspondence in Vision-Language Models
Jian Li, Wei Chen, Xiaojie Wang
Published: 2025-12-09
Link: https://arxiv.org/pdf/2512.04686.pdf
-
PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding
Jian Li, Wei Zhang, Chen Wang, Xiaodong Li
Published: 2025-12-09
Link: https://arxiv.org/pdf/2512.02624.pdf
-
Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints
Ava Chen, Ben Carter, Chloe Davis
Published: 2025-12-04
Link: https://arxiv.org/pdf/2512.00882.pdf
-
S^2-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
Ling Li, Wei Wang, Chen Zhang, Ying Liu, Jian Xu
Published: 2025-12-04
Link: https://arxiv.org/pdf/2512.01223.pdf
-
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
Li Wei, Chen Xiu, Wang Jian
Published: 2025-12-03
Link: https://arxiv.org/pdf/2511.23386.pdf
-
TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots
J. Smith, A. B. Johnson, C. D. Lee, E. F. Garcia, G. H. Wang
Published: 2025-12-02
Link: https://arxiv.org/pdf/2511.17652.pdf
-
Decoupled Audio-Visual Dataset Distillation
Anya Sharma, Ben Carter, Chen Li
Published: 2025-12-01
Link: https://arxiv.org/pdf/2511.17890.pdf
-
CaptionQA: Is Your Caption as Useful as the Image Itself?
Alice K. Chen, Bob L. Davis, Carol M. Evans
Published: 2025-11-28
Link: https://arxiv.org/pdf/2511.21025.pdf
-
Understanding Task Transfer in Vision-Language Models
J. S. Kim, S. A. Chen, P. R. Sharma
Published: 2025-11-27
Link: https://arxiv.org/pdf/2511.18787.pdf