| Boosting Cross-task Transferability of Adversarial Patches with Visual Relations | Apr 11, 2023 | Image CaptioningObject Recognition | —Unverified | 0 |
| CAVL: Learning Contrastive and Adaptive Representations of Vision and Language | Apr 10, 2023 | Image RetrievalPhrase Grounding | —Unverified | 0 |
| Explainable AI And Visual Reasoning: Insights From Radiology | Apr 6, 2023 | DiagnosticExplainable Artificial Intelligence (XAI) | —Unverified | 0 |
| Navigating to Objects Specified by Images | Apr 3, 2023 | NavigateVisual Reasoning | —Unverified | 0 |
| Going Beyond Nouns With Vision & Language Models Using Synthetic Data | Mar 30, 2023 | SentenceVisual Reasoning | CodeCode Available | 1 |
| Your Diffusion Model is Secretly a Zero-Shot Classifier | Mar 28, 2023 | Domain GeneralizationFine-Grained Image Classification | CodeCode Available | 2 |
| Curriculum Learning for Compositional Visual Reasoning | Mar 27, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| IRFL: Image Recognition of Figurative Language | Mar 27, 2023 | ClassificationVisual Reasoning | CodeCode Available | 1 |
| Equivariant Similarity for Vision-Language Foundation Models | Mar 25, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 |
| NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations | Mar 23, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding | Mar 21, 2023 | Knowledge ProbingLanguage Modelling | CodeCode Available | 1 |
| Abstract Visual Reasoning: An Algebraic Approach for Solving Raven's Progressive Matrices | Mar 21, 2023 | Visual Reasoning | CodeCode Available | 1 |
| 3D Concept Learning and Reasoning from Multi-View Images | Mar 20, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning | Mar 18, 2023 | Decision MakingVisual Reasoning | CodeCode Available | 1 |
| ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions | Mar 12, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 2 |
| Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning | Mar 10, 2023 | Few-Shot Image Classificationimage-classification | —Unverified | 0 |
| Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | Mar 10, 2023 | Image Generationmultimodal generation | CodeCode Available | 0 |
| Abstract Visual Reasoning Enabled by Language | Mar 7, 2023 | ARCVisual Reasoning | —Unverified | 0 |
| Visual Analytics of Neuron Vulnerability to Adversarial Attacks on Convolutional Neural Networks | Mar 6, 2023 | Autonomous DrivingMedical Diagnosis | —Unverified | 0 |
| Learning to reason over visual objects | Mar 3, 2023 | Inductive BiasVisual Reasoning | CodeCode Available | 0 |
| Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal Sentence Localization in Videos | Mar 2, 2023 | Representation LearningSentence | —Unverified | 0 |
| Explicit3D: Graph Network with Spatial Inference for Single Image 3D Object Detection | Feb 13, 2023 | 3D Object DetectionGraph Generation | —Unverified | 0 |
| Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis | Feb 11, 2023 | Image-text RetrievalKnowledge Graphs | CodeCode Available | 0 |
| Learning to Agree on Vision Attention for Visual Commonsense Reasoning | Feb 4, 2023 | Visual Commonsense ReasoningVisual Reasoning | —Unverified | 0 |
| Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications | Feb 1, 2023 | Question AnsweringRepresentation Learning | CodeCode Available | 1 |
| UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers | Jan 31, 2023 | Image CaptioningImage Classification | CodeCode Available | 1 |
| BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Jan 30, 2023 | Generative Visual Question AnsweringImage Captioning | CodeCode Available | 4 |
| Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks | Jan 12, 2023 | Cross-Modal RetrievalOpen-Ended Question Answering | CodeCode Available | 0 |
| See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning | Jan 12, 2023 | Few-Shot LearningImage Captioning | CodeCode Available | 1 |
| A Divide-Align-Conquer Strategy for Program Synthesis | Jan 8, 2023 | ARCInductive logic programming | —Unverified | 0 |
| Open Set Video HOI detection from Action-Centric Chain-of-Look Prompting | Jan 1, 2023 | Human-Object Interaction DetectionLanguage Modelling | —Unverified | 0 |
| Toward Multi-Granularity Decision-Making: Explicit Visual Reasoning with Hierarchical Knowledge | Jan 1, 2023 | Decision MakingQuestion Answering | CodeCode Available | 0 |
| Graph Representation for Order-Aware Visual Transformation | Jan 1, 2023 | Visual Reasoning | —Unverified | 0 |
| ViLEM: Visual-Language Error Modeling for Image-Text Retrieval | Jan 1, 2023 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| Unicode Analogies: An Anti-Objectivist Visual Reasoning Challenge | Jan 1, 2023 | NavigateVisual Reasoning | CodeCode Available | 0 |
| Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks | Jan 1, 2023 | Cross-Modal RetrievalImage Captioning | —Unverified | 0 |
| Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training | Jan 1, 2023 | 3D dense captioning3D visual grounding | CodeCode Available | 1 |
| EuclidNet: Deep Visual Reasoning for Constructible Problems in Geometry | Dec 27, 2022 | Automated Theorem ProvingVisual Reasoning | —Unverified | 0 |
| VQA and Visual Reasoning: An Overview of Recent Datasets, Methods and Challenges | Dec 26, 2022 | Representation LearningVisual Question Answering (VQA) | —Unverified | 0 |
| Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment | Dec 20, 2022 | RelationVisual Reasoning | CodeCode Available | 1 |
| Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason? | Dec 20, 2022 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | Dec 19, 2022 | FormQuestion Answering | CodeCode Available | 1 |
| Position-guided Text Prompt for Vision-Language Pre-training | Dec 19, 2022 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 1 |
| Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift | Dec 15, 2022 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| VASR: Visual Analogies of Situation Recognition | Dec 8, 2022 | Common Sense ReasoningTriplet | CodeCode Available | 0 |
| Does Structural Attention Improve Compositional Representations in Vision-Language Models? | Dec 3, 2022 | Visual Reasoning | —Unverified | 0 |
| Visual Question Answering From Another Perspective: CLEVR Mental Rotation Tests | Dec 3, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning | Dec 1, 2022 | Domain GeneralizationQuestion Answering | CodeCode Available | 1 |
| Abstract Visual Reasoning with Tangram Shapes | Nov 29, 2022 | Visual Reasoning | —Unverified | 0 |
| Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation | Nov 28, 2022 | 3D ReconstructionDecoder | CodeCode Available | 1 |