| Latent Alignment and Variational Attention | Jul 10, 2018 | Hard AttentionMachine Translation | CodeCode Available | 0 | 5 |
| Answer Questions with Right Image Regions: A Visual Attention Regularization Approach | Feb 3, 2021 | Question AnsweringVisual Grounding | CodeCode Available | 0 | 5 |
| CAST: Cross-modal Alignment Similarity Test for Vision Language Models | Sep 17, 2024 | cross-modal alignmentQuestion Answering | CodeCode Available | 0 | 5 |
| Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens | Jun 19, 2024 | Caption Generationimage-classification | CodeCode Available | 0 | 5 |
| Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding | Apr 20, 2025 | Autonomous DrivingImage Captioning | CodeCode Available | 0 | 5 |
| Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation | Jun 27, 2024 | Continual LearningQuestion Answering | CodeCode Available | 0 | 5 |
| LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking | Apr 18, 2022 | cross-modal alignmentDocument AI | CodeCode Available | 0 | 5 |
| Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data | Apr 7, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| Cascaded Mutual Modulation for Visual Reasoning | Sep 6, 2018 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning | Nov 17, 2024 | Image CaptioningLanguage Modeling | CodeCode Available | 0 | 5 |
| Answer Them All! Toward Universal Visual Question Answering Models | Mar 1, 2019 | AllQuestion Answering | CodeCode Available | 0 | 5 |
| MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models | Dec 31, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning | Apr 1, 2024 | Image CaptioningInstruction Following | CodeCode Available | 0 | 5 |
| MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks | Mar 29, 2023 | Cross-Modal RetrievalDecoder | CodeCode Available | 0 | 5 |
| Visual Question Answering: which investigated applications? | Mar 4, 2021 | Image CaptioningQuestion Answering | CodeCode Available | 0 | 5 |
| End-to-End Instance Segmentation with Recurrent Attention | May 30, 2016 | Autonomous DrivingImage Captioning | CodeCode Available | 0 | 5 |
| End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features | Jun 21, 2018 | Question AnsweringVideo Description | CodeCode Available | 0 | 5 |
| LPF: A Language-Prior Feedback Objective Function for De-biased Visual Question Answering | May 29, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| LXMERT Model Compression for Visual Question Answering | Oct 23, 2023 | modelModel Compression | CodeCode Available | 0 | 5 |
| Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering | Dec 2, 2016 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 0 | 5 |
| Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding | Mar 18, 2025 | document understandingQuestion Answering | CodeCode Available | 0 | 5 |
| Logical Implications for Visual Question Answering Consistency | Mar 16, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Locally Smoothed Neural Networks | Nov 22, 2017 | Face VerificationQuestion Answering | CodeCode Available | 0 | 5 |
| LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery | Feb 26, 2024 | Continual LearningExemplar-Free | CodeCode Available | 0 | 5 |
| Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a Class-imbalance View | Oct 30, 2020 | Face Recognitionimage-classification | CodeCode Available | 0 | 5 |