| FEDMEKI: A Benchmark for Scaling Medical Foundation Models via Federated Knowledge Injection | Aug 17, 2024 | Federated LearningMedical Visual Question Answering | CodeCode Available | 0 |
| Federated Document Visual Question Answering: A Pilot Study | May 10, 2024 | Federated LearningQuestion Answering | CodeCode Available | 0 |
| OG-SGG: Ontology-Guided Scene Graph Generation. A Case Study in Transfer Learning for Telepresence Robotics | Feb 21, 2022 | BIG-bench Machine LearningGraph Generation | CodeCode Available | 0 |
| Core Tokensets for Data-efficient Sequential Training of Transformers | Oct 8, 2024 | Image Captioningimage-classification | CodeCode Available | 0 |
| Copy-Move Forgery Detection and Question Answering for Remote Sensing Image | Dec 3, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism | Apr 29, 2024 | document understandingGPU | CodeCode Available | 0 |
| OmniFusion Technical Report | Apr 9, 2024 | MM-VetTextVQA | CodeCode Available | 0 |
| Multimodal Residual Learning for Visual QA | Jun 5, 2016 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| OmniNet: A unified architecture for multi-modal multi-task learning | Jul 17, 2019 | Image CaptioningMulti-Task Learning | CodeCode Available | 0 |
| Convincing Rationales for Visual Question Answering Reasoning | Feb 6, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss | May 5, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Multimodal Preference Data Synthetic Alignment with Reward Model | Dec 23, 2024 | 2kCaption Generation | CodeCode Available | 0 |
| Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond | Oct 8, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant | Oct 24, 2024 | Entity LinkingQuestion Answering | CodeCode Available | 0 |
| Factor Graph Attention | Apr 11, 2019 | Graph AttentionQuestion Answering | CodeCode Available | 0 |
| Continual VQA for Disaster Response Systems | Sep 21, 2022 | Disaster ResponseManagement | CodeCode Available | 0 |
| On Modality Bias Recognition and Reduction | Feb 25, 2022 | Action RecognitionMulti-modal Classification | CodeCode Available | 0 |
| Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language | Jan 1, 2023 | Question AnsweringSelf-Supervised Learning | CodeCode Available | 0 |
| Answering Diverse Questions via Text Attached with Key Audio-Visual Clues | Mar 11, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 0 |
| Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering | Jul 28, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos | Sep 21, 2022 | Action DetectionAction Recognition | CodeCode Available | 0 |
| Contextual Dropout: An Efficient Sample-Dependent Dropout Module | Mar 6, 2021 | image-classificationImage Classification | CodeCode Available | 0 |
| Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering | Aug 4, 2017 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Consistency of Compositional Generalization across Multiple Levels | Dec 18, 2024 | Meta-LearningQuestion Answering | CodeCode Available | 0 |
| What's Different between Visual Question Answering for Machine "Understanding" Versus for Accessibility? | Oct 26, 2022 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data | Feb 12, 2024 | DecoderMarketing | CodeCode Available | 0 |
| Synthetic Document Question Answering in Hungarian | May 29, 2025 | Optical Character Recognition (OCR)Question Answering | CodeCode Available | 0 |
| Multimodal Explanations: Justifying Decisions and Pointing to the Evidence | Feb 15, 2018 | Activity RecognitionExplainable Models | CodeCode Available | 0 |
| Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding | Jun 6, 2016 | Phrase GroundingVisual Grounding | CodeCode Available | 0 |
| Language Models Meet Anomaly Detection for Better Interpretability and Generalizability | Apr 11, 2024 | Anomaly DetectionLanguage Modelling | CodeCode Available | 0 |
| Open-Ended Visual Question-Answering | Oct 9, 2016 | Question AnsweringSentence | CodeCode Available | 0 |
| ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images | Apr 16, 2024 | Multimodal Deep LearningOptical Character Recognition (OCR) | CodeCode Available | 0 |
| Multi-Image Visual Question Answering | Dec 27, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| MQA: Answering the Question via Robotic Manipulation | Mar 10, 2020 | Imitation LearningQuestion Answering | CodeCode Available | 0 |
| Open-Set Knowledge-Based Visual Question Answering with Inference Paths | Oct 12, 2023 | Knowledge GraphsMulti-class Classification | CodeCode Available | 0 |
| OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese | May 7, 2023 | Information RetrievalQuestion Answering | CodeCode Available | 0 |
| T2I-FineEval: Fine-Grained Compositional Metric for Text-to-Image Evaluation | Mar 14, 2025 | AttributeQuestion Answering | CodeCode Available | 0 |
| Examining Gender and Racial Bias in Large Vision-Language Models Using a Novel Dataset of Parallel Images | Feb 8, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 0 |
| Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts | Jun 25, 2024 | FairnessQuestion Answering | CodeCode Available | 0 |
| TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning Baselines | Oct 31, 2019 | AttributeQuestion Answering | CodeCode Available | 0 |
| TAB-VCR: Tags and Attributes based VCR Baselines | Dec 1, 2019 | AttributeQuestion Answering | CodeCode Available | 0 |
| TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter | Jun 22, 2023 | Question AnsweringRetrieval | CodeCode Available | 0 |
| OsmLocator: locating overlapping scatter marks with a non-training generative perspective | Dec 18, 2023 | ClusteringCombinatorial Optimization | CodeCode Available | 0 |
| Modulating early visual processing by language | Jul 2, 2017 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Modularized Zero-shot VQA with Pre-trained Models | May 27, 2023 | object-detectionObject Detection | CodeCode Available | 0 |
| What's in a Question: Using Visual Questions as a Form of Supervision | Apr 12, 2017 | Data AugmentationForm | CodeCode Available | 0 |
| Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos | Jun 11, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering | May 26, 2025 | Continual LearningQuestion Answering | CodeCode Available | 0 |
| Evaluating Attribute Comprehension in Large Vision-Language Models | Aug 25, 2024 | AttributeImage-text matching | CodeCode Available | 0 |
| Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering | Oct 26, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |