| Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis | Feb 11, 2023 | Image-text RetrievalKnowledge Graphs | CodeCode Available | 0 |
| Is Multimodal Vision Supervision Beneficial to Language? | Feb 10, 2023 | Image RetrievalNatural Language Understanding | CodeCode Available | 0 |
| BinaryVQA: A Versatile Test Set to Evaluate the Out-of-Distribution Generalization of VQA Models | Jan 28, 2023 | Out-of-Distribution GeneralizationQuestion Answering | CodeCode Available | 0 |
| Towards a Unified Model for Generating Answers and Explanations in Visual Question Answering | Jan 25, 2023 | DecoderExplanation Generation | —Unverified | 0 |
| HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images | Jan 23, 2023 | AttributeQuestion Answering | —Unverified | 0 |
| Towards Models that Can See and Read | Jan 18, 2023 | DecoderImage Captioning | —Unverified | 0 |
| Curriculum Script Distillation for Multilingual Visual Question Answering | Jan 17, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Adaptively Clustering Neighbor Elements for Image-Text Generation | Jan 5, 2023 | ClusteringDecoder | CodeCode Available | 0 |
| From Images to Textual Prompts: Zero-Shot Visual Question Answering With Frozen Large Language Models | Jan 1, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language | Jan 1, 2023 | Question AnsweringSelf-Supervised Learning | CodeCode Available | 0 |
| Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering | Jan 1, 2023 | Continual LearningLanguage Modelling | —Unverified | 0 |
| Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks | Jan 1, 2023 | Cross-Modal RetrievalImage Captioning | —Unverified | 0 |
| RMLVQA: A Margin Loss Approach for Visual Question Answering With Language Biases | Jan 1, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Toward Multi-Granularity Decision-Making: Explicit Visual Reasoning with Hierarchical Knowledge | Jan 1, 2023 | Decision MakingQuestion Answering | CodeCode Available | 0 |
| PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3 | Jan 1, 2023 | Image CaptioningQuestion Answering | —Unverified | 0 |
| When are Lemons Purple? The Concept Association Bias of Vision-Language Models | Dec 22, 2022 | Attributeimage-classification | —Unverified | 0 |
| UnICLAM:Contrastive Representation Learning with Adversarial Masking for Unified and Interpretable Medical Vision Question Answering | Dec 21, 2022 | Data AugmentationDecision Making | —Unverified | 0 |
| From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models | Dec 21, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason? | Dec 20, 2022 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering | Dec 19, 2022 | Chart Question AnsweringData Summarization | —Unverified | 0 |
| SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering | Dec 16, 2022 | Optical Character RecognitionOptical Character Recognition (OCR) | —Unverified | 0 |
| CLIPPO: Image-and-Language Understanding from Pixels Only | Dec 15, 2022 | Contrastive Learningimage-classification | —Unverified | 0 |
| REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory | Dec 10, 2022 | Image CaptioningLanguage Modeling | CodeCode Available | 0 |
| ParsVQA-Caps: A Benchmark for Visual Question Answering and Image Captioning in Persian | Dec 7, 2022 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Visual Question Answering From Another Perspective: CLEVR Mental Rotation Tests | Dec 3, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Compound Tokens: Channel Fusion for Vision-Language Representation Learning | Dec 2, 2022 | DecoderLanguage Modeling | —Unverified | 0 |
| Optimizing Explanations by Network Canonization and Hyperparameter Search | Nov 30, 2022 | Explainable Artificial Intelligence (XAI)image-classification | —Unverified | 0 |
| PiggyBack: Pretrained Visual Question Answering Environment for Backing up Non-deep Learning Professionals | Nov 29, 2022 | Deep LearningQuestion Answering | —Unverified | 0 |
| Neuro-Symbolic Spatio-Temporal Reasoning | Nov 28, 2022 | AI AgentImage Segmentation | —Unverified | 0 |
| Look, Read and Ask: Learning to Ask Questions by Reading Text in Images | Nov 23, 2022 | Optical Character Recognition (OCR)Question Answering | —Unverified | 0 |
| Cross-Modal Contrastive Learning for Robust Reasoning in VQA | Nov 21, 2022 | Contrastive LearningQuestion Answering | CodeCode Available | 0 |
| CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering | Nov 19, 2022 | Continual LearningQuestion Answering | —Unverified | 0 |
| Text-Aware Dual Routing Network for Visual Question Answering | Nov 17, 2022 | Optical Character RecognitionOptical Character Recognition (OCR) | —Unverified | 0 |
| AlignVE: Visual Entailment Recognition Based on Alignment Relations | Nov 16, 2022 | Question AnsweringRelation | —Unverified | 0 |
| Visually Grounded VQA by Lattice-based Retrieval | Nov 15, 2022 | Information RetrievalQuestion Answering | CodeCode Available | 0 |
| MF2-MVQA: A Multi-stage Feature Fusion method for Medical Visual Question Answering | Nov 11, 2022 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Towards Reasoning-Aware Explainable VQA | Nov 9, 2022 | DecoderExplanation Generation | —Unverified | 0 |
| ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation | Nov 9, 2022 | Contrastive LearningDecoder | —Unverified | 0 |
| Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems | Oct 26, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering | Oct 26, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| What's Different between Visual Question Answering for Machine "Understanding" Versus for Accessibility? | Oct 26, 2022 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision | Oct 24, 2022 | cross-modal alignmentCross-Modal Retrieval | —Unverified | 0 |
| RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data | Oct 23, 2022 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| Image Semantic Relation Generation | Oct 19, 2022 | Image RetrievalImage Segmentation | —Unverified | 0 |
| CPL: Counterfactual Prompt Learning for Vision and Language Models | Oct 19, 2022 | counterfactualimage-classification | —Unverified | 0 |
| Aligning MAGMA by Few-Shot Learning and Finetuning | Oct 18, 2022 | Few-Shot LearningImage Captioning | —Unverified | 0 |
| Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering | Oct 18, 2022 | Passage RetrievalQuestion Answering | —Unverified | 0 |
| Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training | Oct 17, 2022 | Image CaptioningNetwork Interpretation | CodeCode Available | 0 |
| Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing | Oct 10, 2022 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning | Oct 9, 2022 | Image-text Retrievalmultimodal interaction | —Unverified | 0 |