| Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding | Mar 21, 2023 | Knowledge ProbingLanguage Modelling | CodeCode Available | 1 |
| Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models | Aug 9, 2021 | Composed Image Retrieval (CoIR)Image Retrieval | CodeCode Available | 1 |
| Beyond Embeddings: The Promise of Visual Table in Visual Reasoning | Mar 27, 2024 | Representation LearningVisual Question Answering | CodeCode Available | 1 |
| UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers | Jan 31, 2023 | Image CaptioningImage Classification | CodeCode Available | 1 |
| Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs | May 24, 2024 | HallucinationResponse Generation | CodeCode Available | 1 |
| LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models | Jul 23, 2024 | Multimodal ReasoningPrompt Engineering | CodeCode Available | 1 |
| Interpreting and Controlling Vision Foundation Models via Text Explanations | Oct 16, 2023 | Model EditingVisual Reasoning | CodeCode Available | 1 |
| Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models | Mar 19, 2024 | image-classificationImage Classification | CodeCode Available | 1 |
| FiLM: Visual Reasoning with a General Conditioning Layer | Sep 22, 2017 | Image Retrieval with Multi-Modal QueryVisual Question Answering (VQA) | CodeCode Available | 1 |
| FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension | Sep 23, 2024 | Image ComprehensionReferring Expression | CodeCode Available | 1 |
| FLAVA: A Foundational Language And Vision Alignment Model | Dec 8, 2021 | Image RetrievalImage-to-Text Retrieval | CodeCode Available | 1 |
| ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension | Jun 17, 2024 | DecoderVisual Reasoning | CodeCode Available | 1 |
| Forgotten Polygons: Multimodal Large Language Models are Shape-Blind | Feb 21, 2025 | MathMathematical Problem-Solving | CodeCode Available | 1 |
| Forward Prediction for Physical Reasoning | Jun 18, 2020 | PredictionVisual Reasoning | CodeCode Available | 1 |
| ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models | Jan 24, 2024 | Visual Reasoning | CodeCode Available | 1 |
| Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training | Jan 1, 2023 | 3D dense captioning3D visual grounding | CodeCode Available | 1 |
| BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | Dec 5, 2023 | BenchmarkingVisual Question Answering | CodeCode Available | 1 |
| Visually Grounded Reasoning across Languages and Cultures | Sep 28, 2021 | Cross-Lingual TransferVisual Reasoning | CodeCode Available | 1 |
| From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding | Sep 27, 2024 | Video UnderstandingVisual Reasoning | CodeCode Available | 1 |
| A Survey on Interpretable Cross-modal Reasoning | Sep 5, 2023 | Cross-Modal RetrievalDecision Making | CodeCode Available | 1 |
| Compositional Chain-of-Thought Prompting for Large Multimodal Models | Nov 27, 2023 | Language ModellingLarge Language Model | CodeCode Available | 1 |
| Visual Semantic Reasoning for Image-Text Matching | Sep 6, 2019 | Cross-Modal RetrievalImage Retrieval | CodeCode Available | 1 |
| How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs | Nov 27, 2023 | Adversarial RobustnessVisual Question Answering (VQA) | CodeCode Available | 1 |
| KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models | Jul 25, 2024 | Visual AnalogiesVisual Reasoning | CodeCode Available | 1 |
| CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations | Apr 5, 2022 | Explanation GenerationQuestion Answering | CodeCode Available | 1 |
| Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing | Sep 26, 2024 | Event DetectionObject | —Unverified | 0 |
| Comparison Visual Instruction Tuning | Jun 13, 2024 | Instruction FollowingNovelty Detection | —Unverified | 0 |
| Comparing Visual Reasoning in Humans and AI | Apr 29, 2021 | SentenceVisual Reasoning | —Unverified | 0 |
| Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases | Apr 16, 2024 | Autonomous DrivingVisual Reasoning | —Unverified | 0 |
| Integrating LMM Planners and 3D Skill Policies for Generalizable Manipulation | Jan 30, 2025 | MemorizationScene Understanding | —Unverified | 0 |
| Guiding Visual Question Answering with Attention Priors | May 25, 2022 | Question AnsweringVisual Grounding | —Unverified | 0 |
| Automated 3D Physical Simulation of Open-world Scene with Gaussian Splatting | Nov 19, 2024 | 3D GenerationGPU | —Unverified | 0 |
| Advancing Generalization Across a Variety of Abstract Visual Reasoning Tasks | May 19, 2025 | Visual Reasoning | —Unverified | 0 |
| Grounding Physical Object and Event Concepts Through Dynamic Visual Reasoning | Jan 1, 2021 | counterfactualObject | —Unverified | 0 |
| Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning | Mar 30, 2021 | counterfactualObject | —Unverified | 0 |
| A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs | Jan 23, 2025 | DescriptiveDiagnostic | —Unverified | 0 |
| A Unified View of Abstract Visual Reasoning Problems | Jun 16, 2024 | Transfer LearningVisual Reasoning | —Unverified | 0 |
| Abstract Visual Reasoning with Tangram Shapes | Nov 29, 2022 | Visual Reasoning | —Unverified | 0 |
| Grounded Reinforcement Learning for Visual Reasoning | May 29, 2025 | reinforcement-learningReinforcement Learning | —Unverified | 0 |
| GRIT: Teaching MLLMs to Think with Images | May 21, 2025 | Reinforcement Learning (RL)Visual Reasoning | —Unverified | 0 |
| Graph Representation for Order-Aware Visual Transformation | Jan 1, 2023 | Visual Reasoning | —Unverified | 0 |
| Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning | May 26, 2025 | reinforcement-learningReinforcement Learning | —Unverified | 0 |
| GSON: A Group-based Social Navigation Framework with Large Multimodal Model | Sep 26, 2024 | Autonomous VehiclesMotion Planning | —Unverified | 0 |
| GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs | Jun 19, 2024 | Spatial ReasoningVisual Reasoning | —Unverified | 0 |
| Code Repair with LLMs gives an Exploration-Exploitation Tradeoff | May 26, 2024 | Code RepairLanguage Modeling | —Unverified | 0 |
| Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning | Jun 8, 2025 | AttributeHallucination | —Unverified | 0 |
| Attention on Abstract Visual Reasoning | Nov 14, 2019 | Program inductionRelation | —Unverified | 0 |
| HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation | Jun 26, 2025 | counterfactualCounterfactual Reasoning | —Unverified | 0 |
| HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model | Jun 1, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 |
| Grammar-Based Grounded Lexicon Learning | Feb 17, 2022 | Network EmbeddingSentence | —Unverified | 0 |