| Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input | Jun 25, 2023 | DiversityImage-text Retrieval | —Unverified | 0 |
| Visual Question Answering in Remote Sensing with Cross-Attention and Multimodal Information Bottleneck | Jun 25, 2023 | object-detectionObject Detection | —Unverified | 0 |
| TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter | Jun 22, 2023 | Question AnsweringRetrieval | CodeCode Available | 0 |
| Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories | Jun 15, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| AVIS: Autonomous Visual Information Seeking with Large Language Model Agent | Jun 13, 2023 | Decision MakingLanguage Modeling | —Unverified | 0 |
| Safeguarding Data in Multimodal AI: A Differentially Private Approach to CLIP Training | Jun 13, 2023 | image-classificationImage Classification | CodeCode Available | 0 |
| Visual Question Answering (VQA) on Images with Superimposed Text | Jun 13, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation | Jun 12, 2023 | Image CaptioningMachine Translation | —Unverified | 0 |
| Knowledge Detection by Relevant Question and Image Attributes in Visual Question Answering | Jun 8, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| Diversifying Joint Vision-Language Tokenization Learning | Jun 6, 2023 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes | Jun 4, 2023 | Common Sense ReasoningQuestion Answering | —Unverified | 0 |
| LiT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote Sensing | Jun 1, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data | Jun 1, 2023 | Anomaly DetectionImage Generation | —Unverified | 0 |
| Overcoming Language Bias in Remote Sensing Visual Question Answering via Adversarial Training | Jun 1, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA | May 31, 2023 | counterfactualCounterfactual Inference | —Unverified | 0 |
| Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models | May 31, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge | May 30, 2023 | Answer SelectionQuestion Answering | —Unverified | 0 |
| HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language | May 28, 2023 | Machine TranslationMultimodal Machine Translation | CodeCode Available | 0 |
| Modularized Zero-shot VQA with Pre-trained Models | May 27, 2023 | object-detectionObject Detection | CodeCode Available | 0 |
| Zero-shot Visual Question Answering with Language Model Feedback | May 26, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Mindstorms in Natural Language-Based Societies of Mind | May 26, 2023 | 3D GenerationImage Captioning | —Unverified | 0 |
| GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions | May 24, 2023 | ObjectQuestion Answering | —Unverified | 0 |
| Measuring Faithful and Plausible Visual Grounding in VQA | May 24, 2023 | Question AnsweringVisual Grounding | CodeCode Available | 0 |
| EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | May 24, 2023 | Image CaptioningLanguage Modelling | —Unverified | 0 |
| Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering | May 24, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |