| ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts | Dec 1, 2023 | Visual Commonsense ReasoningVisual Prompting | CodeCode Available | 0 |
| Improving Vision-and-Language Reasoning via Spatial Relations Modeling | Nov 9, 2023 | Position regressionRelation | —Unverified | 0 |
| ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models | Oct 9, 2023 | Image CaptioningVisual Commonsense Reasoning | —Unverified | 0 |
| A Survey on Interpretable Cross-modal Reasoning | Sep 5, 2023 | Cross-Modal RetrievalDecision Making | CodeCode Available | 1 |
| GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest | Jul 7, 2023 | AttributeCommon Sense Reasoning | CodeCode Available | 2 |
| Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning | May 26, 2023 | Object RecognitionVisual Commonsense Reasoning | —Unverified | 0 |
| GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions | May 24, 2023 | ObjectQuestion Answering | —Unverified | 0 |
| CAVL: Learning Contrastive and Adaptive Representations of Vision and Language | Apr 10, 2023 | Image RetrievalPhrase Grounding | —Unverified | 0 |
| Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images | Mar 13, 2023 | Common Sense ReasoningExplanation Generation | —Unverified | 0 |
| Learning to Agree on Vision Attention for Visual Commonsense Reasoning | Feb 4, 2023 | Visual Commonsense ReasoningVisual Reasoning | —Unverified | 0 |