| Describe Anything Model for Visual Question Answering on Text-rich Images | Jul 16, 2025 | DescriptiveLanguage Modeling | CodeCode Available | 1 |
| Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights | Jul 9, 2025 | DiagnosticMedical Visual Question Answering | —Unverified | 0 |
| MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning | Jul 9, 2025 | DiagnosticMultimodal Reasoning | —Unverified | 0 |
| Evaluating Attribute Confusion in Fashion Text-to-Image Generation | Jul 9, 2025 | Attributecross-modal alignment | —Unverified | 0 |
| LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation | Jul 9, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling | Jul 8, 2025 | ArticlesMultimodal Reasoning | —Unverified | 0 |
| ReLoop: "Seeing Twice and Thinking Backwards" via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding | Jul 7, 2025 | HallucinationQuestion Answering | —Unverified | 0 |
| Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models | Jun 28, 2025 | image-classificationImage Classification | CodeCode Available | 0 |
| SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning | Jun 26, 2025 | In-Context LearningMedical Visual Question Answering | —Unverified | 0 |
| DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images | Jun 26, 2025 | document understandingOptical Character Recognition (OCR) | CodeCode Available | 0 |
| FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering | Jun 25, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction | Jun 25, 2025 | BenchmarkingPerson Identification | CodeCode Available | 0 |
| Semantic-enhanced Modality-asymmetric Retrieval for Online E-commerce Search | Jun 25, 2025 | Question AnsweringRetrieval | —Unverified | 0 |
| GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning | Jun 22, 2025 | Answer GenerationDecision Making | —Unverified | 0 |
| Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations | Jun 21, 2025 | Question AnsweringScene Understanding | —Unverified | 0 |
| Can Common VLMs Rival Medical VLMs? Evaluation and Strategic Insights | Jun 19, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering | Jun 18, 2025 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| Adapting Lightweight Vision Language Models for Radiological Visual Question Answering | Jun 17, 2025 | DiagnosticQuestion Answering | CodeCode Available | 0 |
| SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement | Jun 16, 2025 | document understandingQuestion Answering | CodeCode Available | 1 |
| CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making | Jun 15, 2025 | Answer GenerationDecision Making | —Unverified | 0 |
| AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making | Jun 14, 2025 | Decision MakingQuestion Answering | —Unverified | 0 |
| MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space | Jun 13, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| A Fast, Reliable, and Secure Programming Language for LLM Agents with Code Actions | Jun 13, 2025 | Conformal PredictionQuestion Answering | —Unverified | 0 |
| SlotPi: Physics-informed Object-centric Reasoning Models | Jun 12, 2025 | ObjectQuestion Answering | CodeCode Available | 0 |
| HalLoc: Token-level Localization of Hallucinations for Vision Language Models | Jun 12, 2025 | HallucinationImage Captioning | CodeCode Available | 0 |