| Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding | Mar 21, 2023 | Knowledge ProbingLanguage Modelling | CodeCode Available | 1 |
| Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning | Mar 18, 2023 | Decision MakingVisual Reasoning | CodeCode Available | 1 |
| Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications | Feb 1, 2023 | Question AnsweringRepresentation Learning | CodeCode Available | 1 |
| UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers | Jan 31, 2023 | Image CaptioningImage Classification | CodeCode Available | 1 |
| See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning | Jan 12, 2023 | Few-Shot LearningImage Captioning | CodeCode Available | 1 |
| Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training | Jan 1, 2023 | 3D dense captioning3D visual grounding | CodeCode Available | 1 |
| Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment | Dec 20, 2022 | RelationVisual Reasoning | CodeCode Available | 1 |
| MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | Dec 19, 2022 | FormQuestion Answering | CodeCode Available | 1 |
| Position-guided Text Prompt for Vision-Language Pre-training | Dec 19, 2022 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 1 |
| Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift | Dec 15, 2022 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning | Dec 1, 2022 | Domain GeneralizationQuestion Answering | CodeCode Available | 1 |
| Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation | Nov 28, 2022 | 3D ReconstructionDecoder | CodeCode Available | 1 |
| MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model | Oct 11, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| Belief Revision based Caption Re-ranker with Visual Semantic Information | Sep 16, 2022 | Caption GenerationImage Captioning | CodeCode Available | 1 |
| VIPHY: Probing "Visible" Physical Commonsense Knowledge | Sep 15, 2022 | Visual Reasoning | CodeCode Available | 1 |
| Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment | Aug 29, 2022 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| MixGen: A New Multi-Modal Data Augmentation | Jun 16, 2022 | Data AugmentationImage-text Retrieval | CodeCode Available | 1 |
| Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | Jun 15, 2022 | Described Object DetectionImage Captioning | CodeCode Available | 1 |
| A Benchmark for Compositional Visual Reasoning | Jun 11, 2022 | Visual Reasoning | CodeCode Available | 1 |
| Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning | May 31, 2022 | Common Sense ReasoningGraph Generation | CodeCode Available | 1 |
| CyCLIP: Cyclic Contrastive Language-Image Pretraining | May 28, 2022 | Representation LearningVisual Reasoning | CodeCode Available | 1 |
| Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions | May 27, 2022 | BenchmarkingFew-Shot Image Classification | CodeCode Available | 1 |
| CoCa: Contrastive Captioners are Image-Text Foundation Models | May 4, 2022 | Action ClassificationDecoder | CodeCode Available | 1 |
| Visual Spatial Reasoning | Apr 30, 2022 | Spatial Reasoning | CodeCode Available | 1 |
| RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning | Apr 24, 2022 | Human-Object Interaction DetectionObject | CodeCode Available | 1 |
| Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | Apr 7, 2022 | Visual Reasoning | CodeCode Available | 1 |
| CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations | Apr 5, 2022 | Explanation GenerationQuestion Answering | CodeCode Available | 1 |
| Collaborative Transformers for Grounded Situation Recognition | Mar 30, 2022 | Grounded Situation RecognitionImage Classification | CodeCode Available | 1 |
| REX: Reasoning-aware and Grounded Explanation | Mar 11, 2022 | Decision MakingExplanation Generation | CodeCode Available | 1 |
| Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation | Dec 22, 2021 | Common Sense ReasoningQuestion Answering | CodeCode Available | 1 |
| Distilled Dual-Encoder Model for Vision-Language Understanding | Dec 16, 2021 | Image to textmodel | CodeCode Available | 1 |
| FLAVA: A Foundational Language And Vision Alignment Model | Dec 8, 2021 | Image RetrievalImage-to-Text Retrieval | CodeCode Available | 1 |
| Grounded Situation Recognition with Transformers | Nov 19, 2021 | DecoderGrounded Situation Recognition | CodeCode Available | 1 |
| Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts | Nov 16, 2021 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 1 |
| VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts | Nov 3, 2021 | Image RetrievalImage-text Retrieval | CodeCode Available | 1 |
| An Empirical Study of Training End-to-End Vision-and-Language Transformers | Nov 3, 2021 | Cross-Modal RetrievalDecoder | CodeCode Available | 1 |
| ProTo: Program-Guided Transformer for Program-Guided Tasks | Oct 2, 2021 | Decision MakingLearning to Execute | CodeCode Available | 1 |
| Visually Grounded Reasoning across Languages and Cultures | Sep 28, 2021 | Cross-Lingual TransferVisual Reasoning | CodeCode Available | 1 |
| ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration | Aug 16, 2021 | Visual Reasoning | CodeCode Available | 1 |
| Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models | Aug 9, 2021 | Composed Image Retrieval (CoIR)Image Retrieval | CodeCode Available | 1 |
| Align before Fuse: Vision and Language Representation Learning with Momentum Distillation | Jul 16, 2021 | Cross-Modal RetrievalGrounded language learning | CodeCode Available | 1 |
| Understanding and Evaluating Racial Biases in Image Captioning | Jun 16, 2021 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| Referring Transformer: A One-step Approach to Multi-task Visual Grounding | Jun 6, 2021 | DecoderReferring Expression | CodeCode Available | 1 |
| Learning Relation Alignment for Calibrated Cross-modal Retrieval | May 28, 2021 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 1 |
| Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning | May 10, 2021 | Arithmetic ReasoningGeometry Problem Solving | CodeCode Available | 1 |
| Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning | Apr 7, 2021 | Representation LearningRetrieval | CodeCode Available | 1 |
| ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision | Feb 5, 2021 | Cross-Modal RetrievalImage Retrieval | CodeCode Available | 1 |
| DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue | Jan 1, 2021 | DiagnosticObject Tracking | CodeCode Available | 1 |
| Transformation Driven Visual Reasoning | Nov 26, 2020 | AttributeTriplet | CodeCode Available | 1 |
| Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs | Oct 15, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 1 |