Article Summary

Showing results for: Benchmarking — Clear filter

1 2 Next »

Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights Alice Chen, Bob Davis, Carol White, David Green
Causal Reasoning Benchmarking Unified Understanding AI Generation World Models
Published: 2025-12-06 Link: https://arxiv.org/pdf/2512.01816.pdf
RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios Jian Zhang, Li Wang, Wei Chen
Multimodal Large Language Models MLLMs Spatial Understanding Reasoning Urban Scenarios Benchmarking Autonomous Driving
Published: 2025-11-29 Link: https://arxiv.org/pdf/2511.18011.pdf
AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision-Language-Action Models Ava Chen, Benjamin Lee, Chloe Kim, Daniel Wang
Vision-Language-Action Models Adversarial Attacks Backdoor Attacks Model Robustness Benchmarking
Published: 2025-11-18 Link: https://arxiv.org/pdf/2511.12149.pdf
STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models Alice Chen, Bob Davis, Carla Evans, David Foster
Vision-Language Models Object State Understanding Benchmarking AI Evaluation Reasoning
Published: 2025-11-02 Link: https://arxiv.org/pdf/2510.22571.pdf
Evaluation of Vision-LLMs in Surveillance Video Jian Li, Wei Chen, Mei Lin
Vision-LLMs Surveillance Video Video Analytics Object Recognition Benchmarking
Published: 2025-10-30 Link: https://arxiv.org/pdf/2510.23190.pdf
OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents John Doe, Jane Smith, Robert Johnson
OSWorld Multimodal Computer Perception Benchmarking Tool Invocation Computer-Use Agents Large Language Models Vision-Language Models
Published: 2025-10-29 Link: https://arxiv.org/pdf/2510.24563.pdf
How to Evaluate Monocular Depth Estimation? Jane Doe, John Smith, Alice Johnson
Monocular Depth Estimation Evaluation Metrics Benchmarking Computer Vision 3D Reconstruction
Published: 2025-10-24 Link: https://arxiv.org/pdf/2510.19814.pdf
DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation Jia Li, Wei Wang, Min Chen
Dialect robustness Multimodal generation Benchmarking Large language models Speech synthesis
Published: 2025-10-23 Link: https://arxiv.org/pdf/2510.14949.pdf
PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs Jian Li, Wei Chen, Xin Wang
MLLMs Physical Reasoning Tool Understanding Benchmarking Multimodal AI
Published: 2025-10-19 Link: https://arxiv.org/pdf/2510.09507.pdf
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models Jane Doe, John Smith, Alice Wonderland
Vision-Language-Action Models Robustness Analysis Robotics Adversarial Perturbations Benchmarking
Published: 2025-10-18 Link: https://arxiv.org/pdf/2510.13626.pdf

1 2 Next »