| Chain of Thought Prompt Tuning in Vision Language Models | Apr 16, 2023 | Domain Generalizationimage-classification | —Unverified | 0 |
| PDFVQA: A New Dataset for Real-World VQA on PDF Documents | Apr 13, 2023 | document understandingKey Information Extraction | —Unverified | 0 |
| CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes | Apr 12, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Advancing Medical Imaging with Language Models: A Journey from N-grams to ChatGPT | Apr 11, 2023 | DiagnosticImage Captioning | —Unverified | 0 |
| Boosting Cross-task Transferability of Adversarial Patches with Visual Relations | Apr 11, 2023 | Image CaptioningObject Recognition | —Unverified | 0 |
| CAVL: Learning Contrastive and Adaptive Representations of Vision and Language | Apr 10, 2023 | Image RetrievalPhrase Grounding | —Unverified | 0 |
| Multilingual Augmentation for Robust Visual Question Answering in Remote Sensing Images | Apr 7, 2023 | Contrastive LearningQuestion Answering | —Unverified | 0 |
| Improving Visual Question Answering Models through Robustness Analysis and In-Context Learning with a Chain of Basic Questions | Apr 6, 2023 | In-Context LearningQuestion Answering | —Unverified | 0 |
| I2I: Initializing Adapters with Improvised Knowledge | Apr 4, 2023 | Continual LearningQuestion Answering | CodeCode Available | 1 |
| Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA | Apr 4, 2023 | Answer GenerationLanguage Modelling | —Unverified | 0 |
| SC-ML: Self-supervised Counterfactual Metric Learning for Debiased Visual Question Answering | Apr 4, 2023 | counterfactualMetric Learning | —Unverified | 0 |
| Q2ATransformer: Improving Medical VQA via an Answer Querying Decoder | Apr 4, 2023 | ClassificationDecoder | —Unverified | 0 |
| Instance-Level Trojan Attacks on Visual Question Answering via Adversarial Learning in Neuron Activation Space | Apr 2, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision | Mar 30, 2023 | DecoderMulti-Task Learning | —Unverified | 0 |
| MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks | Mar 29, 2023 | Cross-Modal RetrievalDecoder | CodeCode Available | 0 |
| Curriculum Learning for Compositional Visual Reasoning | Mar 27, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Integrating Image Features with Convolutional Sequence-to-sequence Network for Multilingual Visual Question Answering | Mar 22, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering | Mar 21, 2023 | 4kImage Generation | CodeCode Available | 1 |
| MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | Mar 20, 2023 | Multimodal ReasoningVisual Question Answering | CodeCode Available | 2 |
| 3D Concept Learning and Reasoning from Multi-View Images | Mar 20, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Location-Free Scene Graph Generation | Mar 20, 2023 | Graph GenerationImage Retrieval | CodeCode Available | 1 |
| FVQA 2.0: Introducing Adversarial Samples into Fact-based Visual Question Answering | Mar 19, 2023 | Common Sense ReasoningInformation Retrieval | —Unverified | 0 |
| Logical Implications for Visual Question Answering Consistency | Mar 16, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| GPT-4 Technical Report | Mar 15, 2023 | answerability predictionArithmetic Reasoning | CodeCode Available | 6 |
| Polar-VQA: Visual Question Answering on Remote Sensed Ice sheet Imagery from Polar Region | Mar 13, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Vision-Language Models as Success Detectors | Mar 13, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images | Mar 13, 2023 | Common Sense ReasoningExplanation Generation | —Unverified | 0 |
| Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models | Mar 10, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning | Mar 10, 2023 | Few-Shot Image Classificationimage-classification | —Unverified | 0 |
| Toward Unsupervised Realistic Visual Question Answering | Mar 9, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Interpretable Visual Question Answering Referring to Outside Knowledge | Mar 8, 2023 | DiversityImage Captioning | —Unverified | 0 |
| Graph Neural Networks in Vision-Language Image Understanding: A Survey | Mar 7, 2023 | Image CaptioningImage Retrieval | —Unverified | 0 |
| PaLM-E: An Embodied Multimodal Language Model | Mar 6, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Knowledge-Based Counterfactual Queries for Visual Question Answering | Mar 5, 2023 | counterfactualDecision Making | —Unverified | 0 |
| VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning | Mar 5, 2023 | Answer GenerationEntity Alignment | CodeCode Available | 0 |
| Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering | Mar 3, 2023 | Language ModellingLarge Language Model | CodeCode Available | 2 |
| ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax | Mar 2, 2023 | DescriptiveImage Captioning | CodeCode Available | 1 |
| BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs | Mar 2, 2023 | ArticlesMedical Visual Question Answering | CodeCode Available | 1 |
| MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering | Mar 2, 2023 | Mixture-of-ExpertsQuestion Answering | CodeCode Available | 1 |
| RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training | Mar 1, 2023 | Question AnsweringRetrieval | CodeCode Available | 1 |
| VQA with Cascade of Self- and Co-Attention Blocks | Feb 28, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Language Is Not All You Need: Aligning Perception with Language Models | Feb 27, 2023 | AllImage Captioning | —Unverified | 0 |
| Medical visual question answering using joint self-supervised learning | Feb 25, 2023 | DecoderDiversity | —Unverified | 0 |
| Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | Feb 23, 2023 | Open-Domain Question AnsweringQuestion Answering | CodeCode Available | 1 |
| EVJVQA Challenge: Multilingual Visual Question Answering | Feb 23, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| VinVL+L: Enriching Visual Representation with Location Context in VQA | Feb 22, 2023 | Question AnsweringTAG | CodeCode Available | 0 |
| Reusable Slotwise Mechanisms | Feb 21, 2023 | Future predictionObject | —Unverified | 0 |
| Few-shot Multimodal Multitask Multilingual Learning | Feb 19, 2023 | Few-Shot LearningIn-Context Learning | —Unverified | 0 |
| Interpretable Medical Image Visual Question Answering via Multi-Modal Relationship Graph Learning | Feb 19, 2023 | Graph LearningMedical Visual Question Answering | —Unverified | 0 |
| Bridge Damage Cause Estimation Using Multiple Images Based on Visual Question Answering | Feb 18, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |