| Interpreting and Controlling Vision Foundation Models via Text Explanations | Oct 16, 2023 | Model EditingVisual Reasoning | CodeCode Available | 1 |
| Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World | Oct 16, 2023 | Few-Shot LearningForm | CodeCode Available | 1 |
| Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis | Sep 21, 2023 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 0 |
| Visual Question Answering in the Medical Domain | Sep 20, 2023 | Contrastive LearningMedical Visual Question Answering | —Unverified | 0 |
| A Continual Learning Paradigm for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks | Sep 18, 2023 | Continual LearningVisual Reasoning | —Unverified | 0 |
| MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | Sep 14, 2023 | HallucinationIn-Context Learning | CodeCode Available | 2 |
| Collecting Visually-Grounded Dialogue with A Game Of Sorts | Sep 10, 2023 | Coreference ResolutionImage Retrieval | CodeCode Available | 0 |
| Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models | Sep 8, 2023 | Visual Reasoning | CodeCode Available | 1 |
| A Survey on Interpretable Cross-modal Reasoning | Sep 5, 2023 | Cross-Modal RetrievalDecision Making | CodeCode Available | 1 |
| Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | Aug 31, 2023 | Instruction FollowingVisual Reasoning | CodeCode Available | 1 |
| On the Potential of CLIP for Compositional Logical Reasoning | Aug 30, 2023 | Logical ReasoningVisual Reasoning | —Unverified | 0 |
| EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE | Aug 23, 2023 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| An Examination of the Compositionality of Large Generative Vision-Language Models | Aug 21, 2023 | Visual Reasoning | CodeCode Available | 1 |
| Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories | Aug 21, 2023 | ClassificationClustering | —Unverified | 0 |
| Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models | Aug 18, 2023 | Image-text matchingObject Localization | —Unverified | 0 |
| VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control | Aug 18, 2023 | Image CaptioningText Generation | CodeCode Available | 1 |
| Tree-of-Mixed-Thought: Combining Fast and Slow Thinking for Multi-hop Visual Reasoning | Aug 18, 2023 | Visual Reasoning | —Unverified | 0 |
| Multimodal Analysis Of Google Bard And GPT-Vision: Experiments In Visual Reasoning | Aug 17, 2023 | Common Sense ReasoningOptical Character Recognition | —Unverified | 0 |
| Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks | Aug 17, 2023 | Question AnsweringText Generation | CodeCode Available | 1 |
| Learning logic programs by discovering higher-order abstractions | Aug 16, 2023 | Inductive logic programmingProgram Synthesis | CodeCode Available | 0 |
| Learning Abstract Visual Reasoning via Task Decomposition: A Case Study in Raven Progressive Matrices | Aug 12, 2023 | Visual Reasoning | CodeCode Available | 0 |
| 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment | Aug 8, 2023 | 3D Question Answering (3D-QA)Dense Captioning | CodeCode Available | 2 |
| TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models | Aug 7, 2023 | HallucinationObject Hallucination | CodeCode Available | 2 |
| Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks | Jul 31, 2023 | Image RetrievalObject | —Unverified | 0 |
| LOIS: Looking Out of Instance Semantics for Visual Question Answering | Jul 26, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Grounded Object Centric Learning | Jul 18, 2023 | ObjectObject Discovery | —Unverified | 0 |
| How is ChatGPT's behavior changing over time? | Jul 18, 2023 | Code GenerationLanguage Modelling | CodeCode Available | 4 |
| Does Visual Pretraining Help End-to-End Reasoning? | Jul 17, 2023 | image-classificationImage Classification | —Unverified | 0 |
| Abstracting Concept-Changing Rules for Solving Raven's Progressive Matrix Problems | Jul 15, 2023 | Answer GenerationAnswer Selection | —Unverified | 0 |
| Learning Differentiable Logic Programs for Abstract Visual Reasoning | Jul 3, 2023 | Program inductionVisual Reasoning | CodeCode Available | 1 |
| Look, Remember and Reason: Grounded reasoning in videos with language models | Jun 30, 2023 | Objectobject-detection | —Unverified | 0 |
| Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages | Jun 29, 2023 | Image-text RetrievalMachine Translation | CodeCode Available | 0 |
| PhD Thesis: Exploring the role of (self-)attention in cognitive and computer vision architecture | Jun 26, 2023 | Visual ReasoningZero-shot Generalization | —Unverified | 0 |
| A Survey on Multimodal Large Language Models | Jun 23, 2023 | HallucinationIn-Context Learning | —Unverified | 0 |
| V-LoL: A Diagnostic Dataset for Visual Logical Learning | Jun 13, 2023 | DiagnosticLogical Reasoning | CodeCode Available | 0 |
| A Domain-Independent Agent Architecture for Adaptive Operation in Evolving Open Worlds | Jun 9, 2023 | MinecraftVisual Reasoning | —Unverified | 0 |
| Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding | Jun 9, 2023 | Few-Shot Learningimage-classification | CodeCode Available | 0 |
| Systematic Visual Reasoning through Object-Centric Relational Abstraction | Jun 4, 2023 | ObjectSystematic Generalization | CodeCode Available | 0 |
| Revisiting the Role of Language Priors in Vision-Language Models | Jun 2, 2023 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers | May 27, 2023 | Image CaptioningImage Retrieval | CodeCode Available | 1 |
| What You See is What You Read? Improving Text-Image Alignment Evaluation | May 17, 2023 | Image GenerationImage to text | CodeCode Available | 1 |
| Measuring Progress in Fine-grained Vision-and-Language Understanding | May 12, 2023 | Visual Reasoning | CodeCode Available | 1 |
| Simple Token-Level Confidence Improves Caption Correctness | May 11, 2023 | HallucinationImage Captioning | —Unverified | 0 |
| Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs | May 10, 2023 | Scene UnderstandingVisual Reasoning | —Unverified | 0 |
| Otter: A Multi-Modal Model with In-Context Instruction Tuning | May 5, 2023 | GPUIn-Context Learning | CodeCode Available | 4 |
| Visual Transformation Telling | May 3, 2023 | Dense Video CaptioningVideo Captioning | CodeCode Available | 0 |
| Visual Reasoning: from State to Transformation | May 2, 2023 | Visual Question Answering (VQA)Visual Reasoning | CodeCode Available | 1 |
| MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Apr 20, 2023 | Image DescriptionLanguage Modelling | CodeCode Available | 7 |
| Visual Instruction Tuning | Apr 17, 2023 | 1 Image, 2*2 Stitching3D Question Answering (3D-QA) | CodeCode Available | 6 |
| The role of object-centric representations, guided attention, and external memory on generalizing visual relations | Apr 14, 2023 | RelationVisual Reasoning | —Unverified | 0 |