SOTAVerified

Audio-visual Question Answering

Papers

Showing 127 of 27 papers

TitleStatusHype
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal ReasoningCode2
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual ScenariosCode2
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetCode2
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetCode2
PAVE: Patching and Adapting Video Large Language ModelsCode1
Question-Aware Gaussian Experts for Audio-Visual Question AnsweringCode1
Boosting Audio Visual Question Answering via Key Semantic-Aware CuesCode1
Learning Trimodal Relation for AVQA with Missing ModalityCode1
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question AnsweringCode1
Progressive Spatio-temporal Perception for Audio-Visual Question AnsweringCode1
Vision Transformers are Parameter-Efficient Audio-Visual LearnersCode1
Learning to Answer Questions in Dynamic Audio-Visual ScenariosCode1
Pano-AVQA: Grounded Audio-Visual Question Answering on 360^ VideosCode1
Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg VideosCode1
Learning Sparsity for Effective and Efficient Music Performance Question Answering0
Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMsCode0
AVQACL: A Novel Benchmark for Audio-Visual Question Answering Continual LearningCode0
Patch-level Sounding Object Tracking for Audio-Visual Question Answering0
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering0
OMCAT: Omni Context Aware Transformer0
SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering0
Towards Multilingual Audio-Visual Question AnsweringCode0
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering0
Answering Diverse Questions via Text Attached with Key Audio-Visual CluesCode0
Object-aware Adaptive-Positivity Learning for Audio-Visual Question AnsweringCode0
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA0
Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual ScenariosCode0
Show:102550

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VASTAcc80.7Unverified
2CoQo(Internvideo2)Acc79.6Unverified
3VALORAcc78.9Unverified
4CADAcc78.26Unverified
5LAVISHAcc77.08Unverified
6ST-AVQAAcc71.52Unverified