| VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models | Sep 28, 2023 | Backdoor Attackcross-modal alignment | CodeCode Available | 1 |
| Tackling VQA with Pretrained Foundation Models without Further Training | Sep 27, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Sentence Attention Blocks for Answer Grounding | Sep 20, 2023 | Question AnsweringSentence | —Unverified | 0 |
| DreamLLM: Synergistic Multimodal Comprehension and Creation | Sep 20, 2023 | multimodal generationVisual Question Answering | CodeCode Available | 2 |
| Visual Question Answering in the Medical Domain | Sep 20, 2023 | Contrastive LearningMedical Visual Question Answering | —Unverified | 0 |
| KOSMOS-2.5: A Multimodal Literate Model | Sep 20, 2023 | document understandingmodel | —Unverified | 0 |
| An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models | Sep 18, 2023 | Visual Question Answering | CodeCode Available | 6 |
| Syntax Tree Constrained Graph Network for Visual Question Answering | Sep 17, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| D3: Data Diversity Design for Systematic Generalization in Visual Question Answering | Sep 15, 2023 | DiversityQuestion Answering | CodeCode Available | 0 |
| TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild | Sep 14, 2023 | DecoderInstruction Following | CodeCode Available | 1 |
| Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning | Sep 12, 2023 | Autonomous VehiclesQuestion Answering | —Unverified | 0 |
| Interpretable Visual Question Answering via Reasoning Supervision | Sep 7, 2023 | Common Sense ReasoningQuestion Answering | —Unverified | 0 |
| Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models | Sep 7, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| A Survey on Interpretable Cross-modal Reasoning | Sep 5, 2023 | Cross-Modal RetrievalDecision Making | CodeCode Available | 1 |
| Physically Grounded Vision-Language Models for Robotic Manipulation | Sep 5, 2023 | Image CaptioningLanguage Modelling | —Unverified | 0 |
| Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic Grounding | Sep 1, 2023 | Graph GenerationImage Captioning | CodeCode Available | 0 |
| Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception | Aug 31, 2023 | Activity RecognitionHuman Activity Recognition | —Unverified | 0 |
| Separate and Locate: Rethink the Text in Text-based Visual Question Answering | Aug 31, 2023 | Optical Character Recognition (OCR)Position | CodeCode Available | 0 |
| UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory | Aug 28, 2023 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP | Aug 27, 2023 | Question AnsweringText Generation | CodeCode Available | 1 |
| DLIP: Distilling Language-Image Pre-training | Aug 24, 2023 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | Aug 24, 2023 | Chart Question AnsweringFS-MEVQA | CodeCode Available | 5 |
| InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4 | Aug 23, 2023 | Instruction FollowingQuestion Answering | CodeCode Available | 1 |
| EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE | Aug 23, 2023 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| VQA Therapy: Exploring Answer Differences by Visually Grounding Answers | Aug 21, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes | Aug 21, 2023 | AttributeQuestion Answering | —Unverified | 0 |
| Generic Attention-model Explainability by Weighted Relevance Accumulation | Aug 20, 2023 | Image CaptioningQuestion Answering | —Unverified | 0 |
| StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | Aug 20, 2023 | Visual Question Answering | CodeCode Available | 1 |
| BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions | Aug 19, 2023 | MMEOptical Character Recognition (OCR) | CodeCode Available | 2 |
| Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models | Aug 18, 2023 | Image-text matchingObject Localization | —Unverified | 0 |
| Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks | Aug 17, 2023 | Question AnsweringText Generation | CodeCode Available | 1 |
| Learning the meanings of function words from grounded language using a visual question answering model | Aug 16, 2023 | Logical ReasoningQuestion Answering | CodeCode Available | 0 |
| Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection | Aug 16, 2023 | Image CaptioningLanguage Modeling | CodeCode Available | 1 |
| TeCH: Text-guided Reconstruction of Lifelike Clothed Humans | Aug 16, 2023 | DescriptiveQuestion Answering | CodeCode Available | 2 |
| Foundation Model is Efficient Multimodal Multitask Model Selector | Aug 11, 2023 | modelModel Selection | CodeCode Available | 1 |
| Detecting and Preventing Hallucinations in Large Vision Language Models | Aug 11, 2023 | 16kHallucination | CodeCode Available | 1 |
| Progressive Spatio-temporal Perception for Audio-Visual Question Answering | Aug 10, 2023 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models | Aug 7, 2023 | backdoor defenseobject-detection | CodeCode Available | 0 |
| SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs | Aug 7, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data | Aug 4, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order Logic | Aug 3, 2023 | Chart Question AnsweringFormal Logic | CodeCode Available | 0 |
| OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models | Aug 2, 2023 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 4 |
| ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders | Aug 2, 2023 | Contrastive LearningQuestion Answering | —Unverified | 0 |
| Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering | Jul 28, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control | Jul 28, 2023 | ObjectQuestion Answering | CodeCode Available | 2 |
| BARTPhoBEiT: Pre-trained Sequence-to-Sequence and Image Transformers Models for Vietnamese Visual Question Answering | Jul 28, 2023 | Question AnsweringVietnamese Visual Question Answering | —Unverified | 0 |
| Med-Flamingo: a Multimodal Medical Few-shot Learner | Jul 27, 2023 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 2 |
| LOIS: Looking Out of Instance Semantics for Visual Question Answering | Jul 26, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering | Jul 22, 2023 | Graph Representation LearningLanguage Modeling | CodeCode Available | 1 |
| Robust Visual Question Answering: Datasets, Methods, and Future Challenges | Jul 21, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |