| FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension | Sep 23, 2024 | Image ComprehensionReferring Expression | CodeCode Available | 1 |
| Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs | Jul 26, 2024 | Action GenerationLarge Language Model | CodeCode Available | 1 |
| KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models | Jul 25, 2024 | Visual AnalogiesVisual Reasoning | CodeCode Available | 1 |
| LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models | Jul 23, 2024 | Multimodal ReasoningPrompt Engineering | CodeCode Available | 1 |
| From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis | Jun 28, 2024 | Visual Question Answering (VQA)Visual Reasoning | CodeCode Available | 1 |
| Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding | Jun 27, 2024 | Visual Reasoning | CodeCode Available | 1 |
| Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos | Jun 27, 2024 | Temporal Information ExtractionVisual Reasoning | CodeCode Available | 1 |
| RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding | Jun 18, 2024 | AttributeInstruction Following | CodeCode Available | 1 |
| Slot State Space Models | Jun 18, 2024 | MambaState Space Models | CodeCode Available | 1 |
| ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension | Jun 17, 2024 | DecoderVisual Reasoning | CodeCode Available | 1 |
| Neural Concept Binder | Jun 14, 2024 | DescriptiveRetrieval | CodeCode Available | 1 |
| INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance | Jun 13, 2024 | Multiple-choiceVisual Reasoning | CodeCode Available | 1 |
| Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs | May 24, 2024 | HallucinationResponse Generation | CodeCode Available | 1 |
| MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems | Apr 15, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Beyond Embeddings: The Promise of Visual Table in Visual Reasoning | Mar 27, 2024 | Representation LearningVisual Question Answering | CodeCode Available | 1 |
| HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning | Mar 19, 2024 | Reinforcement Learning (RL)Visual Grounding | CodeCode Available | 1 |
| Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models | Mar 19, 2024 | image-classificationImage Classification | CodeCode Available | 1 |
| How Far Are We from Intelligent Visual Deductive Reasoning? | Mar 7, 2024 | In-Context LearningVisual Reasoning | CodeCode Available | 1 |
| Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks | Mar 1, 2024 | Visual Reasoning | CodeCode Available | 1 |
| Stop Reasoning! When Multimodal LLM with Chain-of-Thought Reasoning Meets Adversarial Image | Feb 22, 2024 | Adversarial RobustnessMultimodal Reasoning | CodeCode Available | 1 |
| ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models | Jan 24, 2024 | Visual Reasoning | CodeCode Available | 1 |
| BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | Dec 5, 2023 | BenchmarkingVisual Question Answering | CodeCode Available | 1 |
| X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning | Nov 30, 2023 | Visual Reasoning | CodeCode Available | 1 |
| How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs | Nov 27, 2023 | Adversarial RobustnessVisual Question Answering (VQA) | CodeCode Available | 1 |
| Compositional Chain-of-Thought Prompting for Large Multimodal Models | Nov 27, 2023 | Language ModellingLarge Language Model | CodeCode Available | 1 |
| GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs | Nov 8, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| NeuSyRE: Neuro-Symbolic Visual Understanding and Reasoning Framework based on Scene Graph Enrichment | Nov 5, 2023 | Caption GenerationCommon Sense Reasoning | CodeCode Available | 1 |
| Weakly Supervised Semantic Parsing with Execution-based Spurious Program Filtering | Nov 2, 2023 | Semantic ParsingVisual Reasoning | CodeCode Available | 1 |
| What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | Nov 2, 2023 | MMEVisual Reasoning | CodeCode Available | 1 |
| Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection | Oct 29, 2023 | Anomaly DetectionImage Captioning | CodeCode Available | 1 |
| What's Left? Concept Grounding with Logic-Enhanced Foundation Models | Oct 24, 2023 | Visual Question Answering (VQA) Split AVisual Question Answering (VQA) Split B | CodeCode Available | 1 |
| Interpreting and Controlling Vision Foundation Models via Text Explanations | Oct 16, 2023 | Model EditingVisual Reasoning | CodeCode Available | 1 |
| Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World | Oct 16, 2023 | Few-Shot LearningForm | CodeCode Available | 1 |
| Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models | Sep 8, 2023 | Visual Reasoning | CodeCode Available | 1 |
| A Survey on Interpretable Cross-modal Reasoning | Sep 5, 2023 | Cross-Modal RetrievalDecision Making | CodeCode Available | 1 |
| Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | Aug 31, 2023 | Instruction FollowingVisual Reasoning | CodeCode Available | 1 |
| An Examination of the Compositionality of Large Generative Vision-Language Models | Aug 21, 2023 | Visual Reasoning | CodeCode Available | 1 |
| VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control | Aug 18, 2023 | Image CaptioningText Generation | CodeCode Available | 1 |
| Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks | Aug 17, 2023 | Question AnsweringText Generation | CodeCode Available | 1 |
| Learning Differentiable Logic Programs for Abstract Visual Reasoning | Jul 3, 2023 | Program inductionVisual Reasoning | CodeCode Available | 1 |
| Revisiting the Role of Language Priors in Vision-Language Models | Jun 2, 2023 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers | May 27, 2023 | Image CaptioningImage Retrieval | CodeCode Available | 1 |
| What You See is What You Read? Improving Text-Image Alignment Evaluation | May 17, 2023 | Image GenerationImage to text | CodeCode Available | 1 |
| Measuring Progress in Fine-grained Vision-and-Language Understanding | May 12, 2023 | Visual Reasoning | CodeCode Available | 1 |
| Visual Reasoning: from State to Transformation | May 2, 2023 | Visual Question Answering (VQA)Visual Reasoning | CodeCode Available | 1 |
| Going Beyond Nouns With Vision & Language Models Using Synthetic Data | Mar 30, 2023 | SentenceVisual Reasoning | CodeCode Available | 1 |
| IRFL: Image Recognition of Figurative Language | Mar 27, 2023 | ClassificationVisual Reasoning | CodeCode Available | 1 |
| Equivariant Similarity for Vision-Language Foundation Models | Mar 25, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 |
| NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations | Mar 23, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| Abstract Visual Reasoning: An Algebraic Approach for Solving Raven's Progressive Matrices | Mar 21, 2023 | Visual Reasoning | CodeCode Available | 1 |