| FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning | Apr 1, 2025 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 2 |
| CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios | Mar 7, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 2 |
| VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | May 29, 2023 | Audio captioningAudio-Visual Captioning | CodeCode Available | 2 |
| VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | Apr 17, 2023 | Audio captioningAudio-Video Question Answering (AVQA) | CodeCode Available | 2 |
| PAVE: Patching and Adapting Video Large Language Models | Mar 25, 2025 | Audio-visual Question AnsweringMulti-Task Learning | CodeCode Available | 1 |
| Question-Aware Gaussian Experts for Audio-Visual Question Answering | Mar 6, 2025 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| Boosting Audio Visual Question Answering via Key Semantic-Aware Cues | Jul 30, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| Learning Trimodal Relation for AVQA with Missing Modality | Jul 23, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering | Apr 18, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| Progressive Spatio-temporal Perception for Audio-Visual Question Answering | Aug 10, 2023 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |