| Reason from Context with Self-supervised Learning | Nov 23, 2022 | ObjectObject Recognition | —Unverified | 0 |
| X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | Nov 22, 2022 | AllCross-Modal Retrieval | CodeCode Available | 2 |
| Unifying Vision-Language Representation Space with Single-tower Transformer | Nov 21, 2022 | Contrastive LearningObject Localization | —Unverified | 0 |
| A survey on knowledge-enhanced multimodal learning | Nov 19, 2022 | Conditional Image GenerationFactual Visual Question Answering | —Unverified | 0 |
| Visual Programming: Compositional visual reasoning without training | Nov 18, 2022 | In-Context LearningQuestion Answering | CodeCode Available | 2 |
| lilGym: Natural Language Visual Reasoning with Reinforcement Learning | Nov 3, 2022 | reinforcement-learningReinforcement Learning | —Unverified | 0 |
| MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model | Oct 11, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning | Oct 9, 2022 | Image-text Retrievalmultimodal interaction | —Unverified | 0 |
| When and why vision-language models behave like bags-of-words, and what to do about it? | Oct 4, 2022 | Contrastive LearningRetrieval | CodeCode Available | 2 |
| Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning | Oct 4, 2022 | Image CaptioningSentence | CodeCode Available | 0 |
| Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach | Oct 3, 2022 | Referring ExpressionRobot Manipulation | CodeCode Available | 0 |
| A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering | Oct 1, 2022 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 0 |
| Zero-shot visual reasoning through probabilistic analogical mapping | Sep 29, 2022 | Visual Reasoning | —Unverified | 0 |
| Deep Neural Networks for Visual Reasoning | Sep 24, 2022 | Multimodal ReasoningVisual Reasoning | —Unverified | 0 |
| Belief Revision based Caption Re-ranker with Visual Semantic Information | Sep 16, 2022 | Caption GenerationImage Captioning | CodeCode Available | 1 |
| Compositional Law Parsing with Latent Random Functions | Sep 15, 2022 | PositionVisual Reasoning | —Unverified | 0 |
| VIPHY: Probing "Visible" Physical Commonsense Knowledge | Sep 15, 2022 | Visual Reasoning | CodeCode Available | 1 |
| PaLI: A Jointly-Scaled Multilingual Language-Image Model | Sep 14, 2022 | DecoderFew-Shot Image Classification | —Unverified | 0 |
| Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment | Aug 29, 2022 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks | Aug 22, 2022 | AllCross-Modal Retrieval | CodeCode Available | 0 |
| One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning | Jul 31, 2022 | AllReferring Expression | —Unverified | 0 |
| WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models | Jul 25, 2022 | Common Sense ReasoningGeneral Knowledge | CodeCode Available | 0 |
| 3D Concept Grounding on Neural Fields | Jul 13, 2022 | Instance SegmentationQuestion Answering | —Unverified | 0 |
| From Shallow to Deep: Compositional Reasoning over Graphs for Visual Question Answering | Jun 25, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives | Jun 22, 2022 | Feature ImportanceQuestion Answering | CodeCode Available | 0 |
| SAViR-T: Spatially Attentive Visual Reasoning with Transformers | Jun 18, 2022 | Inductive BiasVisual Reasoning | CodeCode Available | 0 |
| Interactive Visual Reasoning under Uncertainty | Jun 18, 2022 | Visual Reasoning | —Unverified | 0 |
| MixGen: A New Multi-Modal Data Augmentation | Jun 16, 2022 | Data AugmentationImage-text Retrieval | CodeCode Available | 1 |
| Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | Jun 15, 2022 | Described Object DetectionImage Captioning | CodeCode Available | 1 |
| A Benchmark for Compositional Visual Reasoning | Jun 11, 2022 | Visual Reasoning | CodeCode Available | 1 |
| GAMR: A Guided Attention Model for (visual) Reasoning | Jun 10, 2022 | modelVisual Reasoning | CodeCode Available | 0 |
| VL-BEiT: Generative Vision-Language Pretraining | Jun 2, 2022 | image-classificationImage Classification | —Unverified | 0 |
| Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning | May 31, 2022 | Common Sense ReasoningGraph Generation | CodeCode Available | 1 |
| Few-shot Subgoal Planning with Language Models | May 28, 2022 | Language ModelingLanguage Modelling | —Unverified | 0 |
| CyCLIP: Cyclic Contrastive Language-Image Pretraining | May 28, 2022 | Representation LearningVisual Reasoning | CodeCode Available | 1 |
| Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions | May 27, 2022 | BenchmarkingFew-Shot Image Classification | CodeCode Available | 1 |
| Guiding Visual Question Answering with Attention Priors | May 25, 2022 | Question AnsweringVisual Grounding | —Unverified | 0 |
| Continual learning on 3D point clouds with random compressed rehearsal | May 16, 2022 | Continual LearningVisual Reasoning | —Unverified | 0 |
| Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering | May 9, 2022 | multimodal interactionQuestion Answering | CodeCode Available | 0 |
| Introduction to Soar | May 8, 2022 | ChunkingDecision Making | —Unverified | 0 |
| QLEVR: A Diagnostic Dataset for Quantificational Language and Elementary Visual Reasoning | May 6, 2022 | DiagnosticQuestion Answering | CodeCode Available | 0 |
| CoCa: Contrastive Captioners are Image-Text Foundation Models | May 4, 2022 | Action ClassificationDecoder | CodeCode Available | 1 |
| Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering | May 2, 2022 | DecoderImage Captioning | —Unverified | 0 |
| Visual Spatial Reasoning | Apr 30, 2022 | Spatial Reasoning | CodeCode Available | 1 |
| RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning | Apr 24, 2022 | Human-Object Interaction DetectionObject | CodeCode Available | 1 |
| Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | Apr 7, 2022 | Visual Reasoning | CodeCode Available | 1 |
| CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations | Apr 5, 2022 | Explanation GenerationQuestion Answering | CodeCode Available | 1 |
| Co-VQA : Answering by Interactive Sub Question Sequence | Apr 2, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Collaborative Transformers for Grounded Situation Recognition | Mar 30, 2022 | Grounded Situation RecognitionImage Classification | CodeCode Available | 1 |
| REX: Reasoning-aware and Grounded Explanation | Mar 11, 2022 | Decision MakingExplanation Generation | CodeCode Available | 1 |