SOTAVerified

Audio-visual Question Answering

Papers

Showing 2127 of 27 papers

TitleStatusHype
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetCode2
Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual ScenariosCode0
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetCode2
Vision Transformers are Parameter-Efficient Audio-Visual LearnersCode1
Learning to Answer Questions in Dynamic Audio-Visual ScenariosCode1
Pano-AVQA: Grounded Audio-Visual Question Answering on 360^ VideosCode1
Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg VideosCode1
Show:102550
← PrevPage 3 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VASTAcc80.7Unverified
2CoQo(Internvideo2)Acc79.6Unverified
3VALORAcc78.9Unverified
4CADAcc78.26Unverified
5LAVISHAcc77.08Unverified
6ST-AVQAAcc71.52Unverified