| Compositional Image-Text Matching and Retrieval by Grounding Entities | May 4, 2025 | Image CaptioningImage-text matching | CodeCode Available | 0 |
| Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing | Jan 15, 2025 | Visual Commonsense Reasoning | —Unverified | 0 |
| How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey | Dec 11, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor | Dec 8, 2024 | MisconceptionsMultiple-choice | CodeCode Available | 0 |
| Improving Visual Commonsense in Language Models via Multiple Image Generation | Jun 19, 2024 | Common Sense ReasoningImage Generation | CodeCode Available | 1 |
| Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? | Jun 11, 2024 | Adversarial TextImage Generation | —Unverified | 0 |
| ALGO: Object-Grounded Visual Commonsense Reasoning for Open-World Egocentric Action Recognition | Jun 9, 2024 | Action RecognitionObject Recognition | —Unverified | 0 |
| Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models | Jun 3, 2024 | Image CaptioningLanguage Modelling | CodeCode Available | 2 |
| Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR | May 27, 2024 | Question AnsweringTAG | —Unverified | 0 |
| EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning | Apr 22, 2024 | Visual Commonsense Reasoning | —Unverified | 0 |
| ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts | Dec 1, 2023 | Visual Commonsense ReasoningVisual Prompting | CodeCode Available | 0 |
| Improving Vision-and-Language Reasoning via Spatial Relations Modeling | Nov 9, 2023 | Position regressionRelation | —Unverified | 0 |
| ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models | Oct 9, 2023 | Image CaptioningVisual Commonsense Reasoning | —Unverified | 0 |
| A Survey on Interpretable Cross-modal Reasoning | Sep 5, 2023 | Cross-Modal RetrievalDecision Making | CodeCode Available | 1 |
| GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest | Jul 7, 2023 | AttributeCommon Sense Reasoning | CodeCode Available | 2 |
| Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning | May 26, 2023 | Object RecognitionVisual Commonsense Reasoning | —Unverified | 0 |
| GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions | May 24, 2023 | ObjectQuestion Answering | —Unverified | 0 |
| CAVL: Learning Contrastive and Adaptive Representations of Vision and Language | Apr 10, 2023 | Image RetrievalPhrase Grounding | —Unverified | 0 |
| Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images | Mar 13, 2023 | Common Sense ReasoningExplanation Generation | —Unverified | 0 |
| Learning to Agree on Vision Attention for Visual Commonsense Reasoning | Feb 4, 2023 | Visual Commonsense ReasoningVisual Reasoning | —Unverified | 0 |
| Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning | Jan 30, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning | Jan 1, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| VASR: Visual Analogies of Situation Recognition | Dec 8, 2022 | Common Sense ReasoningTriplet | CodeCode Available | 0 |
| A survey on knowledge-enhanced multimodal learning | Nov 19, 2022 | Conditional Image GenerationFactual Visual Question Answering | —Unverified | 0 |
| Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | Sep 20, 2022 | Multimodal Deep LearningMultimodal Reasoning | CodeCode Available | 2 |