| SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection | Mar 5, 2024 | Concept AlignmentExplanation Generation | —Unverified | 0 |
| What Is Missing in Multilingual Visual Reasoning and How to Fix It | Mar 3, 2024 | Image CaptioningVisual Reasoning | CodeCode Available | 0 |
| Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks | Mar 1, 2024 | Visual Reasoning | CodeCode Available | 1 |
| Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual Reasoning | Mar 1, 2024 | DisentanglementInformativeness | CodeCode Available | 0 |
| VISREAS: Complex Visual Reasoning with Unanswerable Questions | Feb 23, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Stop Reasoning! When Multimodal LLM with Chain-of-Thought Reasoning Meets Adversarial Image | Feb 22, 2024 | Adversarial RobustnessMultimodal Reasoning | CodeCode Available | 1 |
| PALO: A Polyglot Large Multimodal Model for 5B People | Feb 22, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Visual Reasoning in Object-Centric Deep Neural Networks: A Comparative Cognition Approach | Feb 20, 2024 | ObjectRelational Reasoning | CodeCode Available | 0 |
| Visual In-Context Learning for Large Vision-Language Models | Feb 18, 2024 | In-Context LearningPosition | —Unverified | 0 |
| ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling | Feb 9, 2024 | HallucinationNatural Language Understanding | CodeCode Available | 0 |
| CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations | Feb 6, 2024 | Visual Reasoning | CodeCode Available | 3 |
| Neural networks for abstraction and reasoning: Towards broad generalization in machines | Feb 5, 2024 | ARCVisual Reasoning | CodeCode Available | 3 |
| Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA | Jan 29, 2024 | BenchmarkingImage Comprehension | —Unverified | 0 |
| ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models | Jan 24, 2024 | Visual Reasoning | CodeCode Available | 1 |
| Prompting Large Vision-Language Models for Compositional Reasoning | Jan 20, 2024 | RetrievalVisual Reasoning | CodeCode Available | 0 |
| Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually | Jan 19, 2024 | counterfactualCounterfactual Explanation | CodeCode Available | 0 |
| Towards Generative Abstract Reasoning: Completing Raven's Progressive Matrix via Rule Abstraction and Selection | Jan 18, 2024 | Answer GenerationAttribute | CodeCode Available | 0 |
| Language-Conditioned Robotic Manipulation with Fast and Slow Thinking | Jan 8, 2024 | Decision MakingIntent Recognition | —Unverified | 0 |
| CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | Jan 5, 2024 | Image ComprehensionImage to text | CodeCode Available | 0 |
| Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers | Jan 3, 2024 | Question AnsweringVisual Grounding | —Unverified | 0 |
| Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts | Jan 1, 2024 | Image GenerationInstruction Following | —Unverified | 0 |
| ChartBench: A Benchmark for Complex Visual Reasoning in Charts | Dec 26, 2023 | Visual Reasoning | —Unverified | 0 |
| VCoder: Versatile Vision Encoders for Multimodal Large Language Models | Dec 21, 2023 | Image CaptioningImage Generation | CodeCode Available | 2 |
| A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise | Dec 19, 2023 | MMEVisual Reasoning | CodeCode Available | 0 |
| One Self-Configurable Model to Solve Many Abstract Visual Reasoning Problems | Dec 15, 2023 | Odd One OutTransfer Learning | CodeCode Available | 0 |
| GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives | Dec 7, 2023 | Graph GenerationLanguage Modelling | CodeCode Available | 0 |
| BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | Dec 5, 2023 | BenchmarkingVisual Question Answering | CodeCode Available | 1 |
| X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning | Nov 30, 2023 | Visual Reasoning | CodeCode Available | 1 |
| Leveraging VLM-Based Pipelines to Annotate 3D Objects | Nov 29, 2023 | In-Context LearningLanguage Modeling | —Unverified | 0 |
| Compositional Chain-of-Thought Prompting for Large Multimodal Models | Nov 27, 2023 | Language ModellingLarge Language Model | CodeCode Available | 1 |
| MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI | Nov 27, 2023 | Complex Query AnsweringLogical Reasoning | CodeCode Available | 5 |
| How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs | Nov 27, 2023 | Adversarial RobustnessVisual Question Answering (VQA) | CodeCode Available | 1 |
| From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation | Nov 21, 2023 | Explanation GenerationVisual Question Answering (VQA) | —Unverified | 0 |
| SelfEval: Leveraging the discriminative nature of generative models for evaluation | Nov 17, 2023 | AttributeVisual Reasoning | —Unverified | 0 |
| The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task | Nov 15, 2023 | Visual Reasoning | —Unverified | 0 |
| Solving ARC visual analogies with neural embeddings and vector arithmetic: A generalized method | Nov 14, 2023 | ARCDimensionality Reduction | CodeCode Available | 0 |
| Adaptive recurrent vision performs zero-shot computation scaling to unseen difficulty levels | Nov 12, 2023 | PathfinderVisual Reasoning | —Unverified | 0 |
| Visual Commonsense based Heterogeneous Graph Contrastive Learning | Nov 11, 2023 | Contrastive LearningQuestion Answering | —Unverified | 0 |
| Towards A Unified Neural Architecture for Visual Recognition and Reasoning | Nov 10, 2023 | Objectobject-detection | —Unverified | 0 |
| GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs | Nov 8, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| NeuSyRE: Neuro-Symbolic Visual Understanding and Reasoning Framework based on Scene Graph Enrichment | Nov 5, 2023 | Caption GenerationCommon Sense Reasoning | CodeCode Available | 1 |
| What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | Nov 2, 2023 | MMEVisual Reasoning | CodeCode Available | 1 |
| Weakly Supervised Semantic Parsing with Execution-based Spurious Program Filtering | Nov 2, 2023 | Semantic ParsingVisual Reasoning | CodeCode Available | 1 |
| Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection | Oct 29, 2023 | Anomaly DetectionImage Captioning | CodeCode Available | 1 |
| OC-NMN: Object-centric Compositional Neural Module Network for Generative Visual Analogical Reasoning | Oct 28, 2023 | Data AugmentationOut-of-Distribution Generalization | —Unverified | 0 |
| Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting | Oct 28, 2023 | RelationVisual Reasoning | —Unverified | 0 |
| ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese | Oct 27, 2023 | Information RetrievalNatural Language Queries | CodeCode Available | 0 |
| Multimodal Representations for Teacher-Guided Compositional Visual Reasoning | Oct 24, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| What's Left? Concept Grounding with Logic-Enhanced Foundation Models | Oct 24, 2023 | Visual Question Answering (VQA) Split AVisual Question Answering (VQA) Split B | CodeCode Available | 1 |
| Superpixel Semantics Representation and Pre-training for Vision-Language Task | Oct 20, 2023 | Self-Supervised LearningSuperpixels | —Unverified | 0 |