SOTAVerified

TGIF-Frame

Papers

Showing 110 of 15 papers

TitleStatusHype
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering ModelsCode1
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text ModelsCode1
Lightweight Recurrent Cross-modal Encoder for Video Question AnsweringCode0
COSA: Concatenated Sample Pretrained Vision-Language Foundation ModelCode1
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetCode2
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending0
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetCode2
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation ModelsCode1
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling0
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and VideoCode4
Show:102550
← PrevPage 1 of 2Next →

No leaderboard results yet.