| VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | May 29, 2023 | Audio captioningAudio-Visual Captioning | CodeCode Available | 2 |
| Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios | May 21, 2023 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 0 |
| VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | Apr 17, 2023 | Audio captioningAudio-Video Question Answering (AVQA) | CodeCode Available | 2 |
| Vision Transformers are Parameter-Efficient Audio-Visual Learners | Dec 15, 2022 | Audio-visual Question AnsweringAUDIO-VISUAL QUESTION ANSWERING (MUSIC-AVQA-v2.0) | CodeCode Available | 1 |
| Learning to Answer Questions in Dynamic Audio-Visual Scenarios | Mar 26, 2022 | audio-visual learningAudio-visual Question Answering | CodeCode Available | 1 |
| Pano-AVQA: Grounded Audio-Visual Question Answering on 360^ Videos | Oct 11, 2021 | Audio-visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos | Jan 1, 2021 | Audio-visual Question AnsweringQuestion Answering | CodeCode Available | 1 |