| HAMMR: HierArchical MultiModal React agents for generic VQA | Apr 8, 2024 | Optical Character Recognition (OCR)Question Answering | —Unverified | 0 |
| Challenges Faced by Large Language Models in Solving Multi-Agent Flocking | Apr 6, 2024 | Decision MakingSpatial Reasoning | —Unverified | 0 |
| Grounding Spatial Relations in Text-Only Language Models | Mar 20, 2024 | Spatial Reasoning | CodeCode Available | 0 |
| SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors | Mar 18, 2024 | HallucinationMotion Planning | —Unverified | 0 |
| JSTR: Joint Spatio-Temporal Reasoning for Event-based Moving Object Detection | Mar 12, 2024 | Motion CompensationMoving Object Detection | —Unverified | 0 |
| DivCon: Divide and Conquer for Progressive Text-to-Image Generation | Mar 11, 2024 | Image GenerationLayout-to-Image Generation | —Unverified | 0 |
| Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training | Mar 4, 2024 | MathPhrase Grounding | —Unverified | 0 |
| A Surprising Failure? Multimodal LLMs and the NLVR Challenge | Feb 26, 2024 | SentenceSpatial Reasoning | —Unverified | 0 |
| DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models | Feb 19, 2024 | Autonomous DrivingScene Understanding | —Unverified | 0 |
| PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs | Feb 12, 2024 | Instruction FollowingLogical Reasoning | —Unverified | 0 |